Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepseek-moe-16B预训练问题 #7165

Open
1 task done
zyp-byte opened this issue Mar 5, 2025 · 2 comments
Open
1 task done

deepseek-moe-16B预训练问题 #7165

zyp-byte opened this issue Mar 5, 2025 · 2 comments
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@zyp-byte
Copy link

zyp-byte commented Mar 5, 2025

Reminder

  • I have read the above rules and searched the existing issues.

System Info

  • llamafactory version: 0.9.2.dev0
  • Platform: Linux-5.10.0-1.0.0.26-x86_64-with-glibc2.27
  • Python version: 3.10.14
  • PyTorch version: 2.3.0 (GPU)
  • Transformers version: 4.46.1
  • Datasets version: 3.1.0
  • Accelerate version: 1.0.1
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA A800-SXM4-80GB
  • DeepSpeed version: 0.15.4

Reproduction

想要full pretrain deepseek-moe-16B模型,参数为:

### model
model_name_or_path: model/deepseek-moe-16b-base
trust_remote_code: true

### method
stage: pt
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json  # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]

### dataset
dataset: c4_text
cutoff_len: 2048
overwrite_cache: true
preprocessing_num_workers: 30

### output
output_dir: saves/deepseek-moe/full/pretrain
logging_steps: 1000
save_steps: 50000
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 50000

但是报如下错误:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/paddlejob/workspace/zhangyuanpei/slm_train/src/llamafactory/launcher.py", line 23, in <module>
[rank0]:     launch()
[rank0]:   File "/root/paddlejob/workspace/zhangyuanpei/slm_train/src/llamafactory/launcher.py", line 19, in launch
[rank0]:     run_exp()
[rank0]:   File "/root/paddlejob/workspace/zhangyuanpei/slm_train/src/llamafactory/train/tuner.py", line 57, in run_exp
[rank0]:     run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
[rank0]:   File "/root/paddlejob/workspace/zhangyuanpei/slm_train/src/llamafactory/train/pt/workflow.py", line 63, in run_pt
[rank0]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3606, in training_step
[rank0]:     self.accelerator.backward(loss, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank0]:     self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 195, in backward
[rank0]:     self.engine.step()
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2213, in step
[rank0]:     self._take_model_step(lr_kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2119, in _take_model_step
[rank0]:     self.optimizer.step()
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2095, in step
[rank0]:     self._optimizer_step(sub_group_id)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 979, in _optimizer_step
[rank0]:     self.optimizer.step()
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
[rank0]:     return wrapped(*args, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank0]:     out = func(*args, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
[rank0]:     ret = func(self, *args, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/optim/adamw.py", line 177, in step
[rank0]:     has_complex = self._init_group(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/optim/adamw.py", line 128, in _init_group
[rank0]:     state["exp_avg_sq"] = torch.zeros_like(
[rank0]: RuntimeError: r.nvmlDeviceGetNvLinkRemoteDeviceType_ INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1712608839953/work/c10/cuda/driver_api.cpp":27, please report a bug to PyTorch. Can't find nvmlDeviceGetNvLinkRemoteDeviceType: /home/opt/gpuproxy/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType

该环境跑qwen2.5-0.5B模型的full pretrain是没有任何问题的,当时的参数为:

### model
model_name_or_path: model/Qwen2.5-0.5B
trust_remote_code: true

### method
stage: pt
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json  # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]

### dataset
dataset: c4_text
cutoff_len: 2048
overwrite_cache: true
preprocessing_num_workers: 30

### output
output_dir: saves/Qwen2.5-0.5B/full/pretrain
logging_steps: 1000
save_steps: 50000
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 50000

所以我在想是不是deepspeed配置有点问题,或者是我少了其他哪些步骤

Others

No response

@zyp-byte zyp-byte added bug Something isn't working pending This problem is yet to be addressed labels Mar 5, 2025
@zyp-byte
Copy link
Author

zyp-byte commented Mar 5, 2025

拿Qwen1.5-MoE-A2.7B试了一下,报了同样的错误。如果llamafactory支持moe的full pretrain的话,我应该怎么配置我的参数呢?

@hiyouga
Copy link
Owner

hiyouga commented Mar 5, 2025

看起来是机器问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

2 participants