Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unsloth偶尔出现loss为0 #7268

Open
1 task done
EntropyYue opened this issue Mar 12, 2025 · 2 comments
Open
1 task done

unsloth偶尔出现loss为0 #7268

EntropyYue opened this issue Mar 12, 2025 · 2 comments
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@EntropyYue
Copy link

EntropyYue commented Mar 12, 2025

Reminder

  • I have read the above rules and searched the existing issues.

System Info

  • llamafactory version: 0.9.2.dev0
  • Platform: Windows-10-10.0.19045-SP0
  • Python version: 3.10.16
  • PyTorch version: 2.6.0+cu124 (GPU)
  • Transformers version: 4.48.2
  • Datasets version: 3.1.0
  • Accelerate version: 1.0.1
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA GeForce RTX 2080 Ti
  • GPU number: 1
  • GPU memory: 22.00GB
  • Bitsandbytes version: 0.45.1

Reproduction

llamafactory-cli train `
    --stage sft `
    --do_train True `
    --model_name_or_path H:\models\xxx `
    --preprocessing_num_workers 16 `
    --finetuning_type lora `
    --template qwen `
    --flash_attn auto `
    --use_unsloth True `
    --dataset_dir data `
    --dataset xxx `
    --cutoff_len 512 `
    --learning_rate 5e-05 `
    --num_train_epochs 2.0 `
    --max_samples 100000 `
    --per_device_train_batch_size 8 `
    --gradient_accumulation_steps 2 `
    --lr_scheduler_type cosine `
    --max_grad_norm 1.0 `
    --logging_steps 5 `
    --save_steps 100 `
    --warmup_steps 0 `
    --packing False `
    --report_to none `
    --use_swanlab True `
    --output_dir saves\Qwen2.5-14B-Instruct\lora\xxx `
    --fp16 True `
    --plot_loss True `
    --trust_remote_code True `
    --ddp_timeout 180000000 `
    --include_num_input_tokens_seen True `
    --optim adamw_torch `
    --adapter_name_or_path saves\Qwen2.5-14B-Instruct\lora\xxx `
    --quantization_bit 4 `
    --quantization_method bitsandbytes `
    --double_quantization True `
    --lora_rank 8 `
    --lora_alpha 16 `
    --lora_dropout 0.1 `
    --loraplus_lr_ratio 16 `
    --lora_target all `
    --swanlab_project Scarlett `
    --swanlab_run_name xxx `
    --swanlab_mode local `
    --val_size 0.1 `
    --eval_strategy steps `
    --eval_steps 100 `
    --per_device_eval_batch_size 8
 97%|██████████████████████████████████████████████████████████████████████████  | 1905/1956 [7:12:54<30:49, 36.26s/it][INFO|2025-03-12 13:45:21] llamafactory.train.callbacks:157 >> {'loss': 0.0000, 'learning_rate': 3.9427e-05, 'epoch': 1.95, 'throughput': 273.54}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.9426990162541475e-05, 'epoch': 1.95, 'num_input_tokens_seen': 7105680}
 98%|██████████████████████████████████████████████████████████████████████████▏ | 1910/1956 [7:13:41<10:45, 14.03s/it][INFO|2025-03-12 13:46:09] llamafactory.train.callbacks:157 >> {'loss': 0.3393, 'learning_rate': 3.9427e-05, 'epoch': 1.95, 'throughput': 273.72}
{'loss': 0.3393, 'grad_norm': nan, 'learning_rate': 3.9426990162541475e-05, 'epoch': 1.95, 'num_input_tokens_seen': 7123280}
 98%|██████████████████████████████████████████████████████████████████████████▍ | 1915/1956 [7:14:29<06:57, 10.17s/it][INFO|2025-03-12 13:46:56] llamafactory.train.callbacks:157 >> {'loss': 0.0923, 'learning_rate': 3.9427e-05, 'epoch': 1.96, 'throughput': 273.90}
{'loss': 0.0923, 'grad_norm': nan, 'learning_rate': 3.9426990162541475e-05, 'epoch': 1.96, 'num_input_tokens_seen': 7140880}
 98%|██████████████████████████████████████████████████████████████████████████▌ | 1920/1956 [7:15:16<05:45,  9.61s/it][INFO|2025-03-12 13:47:43] llamafactory.train.callbacks:157 >> {'loss': 0.0000, 'learning_rate': 3.9427e-05, 'epoch': 1.96, 'throughput': 274.08}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.9426990162541475e-05, 'epoch': 1.96, 'num_input_tokens_seen': 7158544}
 98%|██████████████████████████████████████████████████████████████████████████▊ | 1925/1956 [7:16:06<05:06,  9.90s/it][INFO|2025-03-12 13:48:33] llamafactory.train.callbacks:157 >> {'loss': 0.1628, 'learning_rate': 3.9427e-05, 'epoch': 1.97, 'throughput': 274.27}
{'loss': 0.1628, 'grad_norm': nan, 'learning_rate': 3.9426990162541475e-05, 'epoch': 1.97, 'num_input_tokens_seen': 7177360}
 99%|██████████████████████████████████████████████████████████████████████████▉ | 1930/1956 [7:16:56<04:18,  9.94s/it][INFO|2025-03-12 13:49:23] llamafactory.train.callbacks:157 >> {'loss': 0.0000, 'learning_rate': 3.9427e-05, 'epoch': 1.97, 'throughput': 274.46}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.9426990162541475e-05, 'epoch': 1.97, 'num_input_tokens_seen': 7195920}

Others

使用unsloth微调,loss有时会变成0 ,使用bitsandbytes4位qlora微调

@EntropyYue EntropyYue added bug Something isn't working pending This problem is yet to be addressed labels Mar 12, 2025
@hiyouga
Copy link
Owner

hiyouga commented Mar 12, 2025

关闭 unsloth 试试

@EntropyYue
Copy link
Author

关闭后正常

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

2 participants