deepseek-moe-16B预训练问题 #7165

zyp-byte · 2025-03-05T08:01:09Z

Reminder

I have read the above rules and searched the existing issues.

System Info

llamafactory version: 0.9.2.dev0
Platform: Linux-5.10.0-1.0.0.26-x86_64-with-glibc2.27
Python version: 3.10.14
PyTorch version: 2.3.0 (GPU)
Transformers version: 4.46.1
Datasets version: 3.1.0
Accelerate version: 1.0.1
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA A800-SXM4-80GB
DeepSpeed version: 0.15.4

Reproduction

想要full pretrain deepseek-moe-16B模型，参数为：

### model
model_name_or_path: model/deepseek-moe-16b-base
trust_remote_code: true

### method
stage: pt
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json  # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]

### dataset
dataset: c4_text
cutoff_len: 2048
overwrite_cache: true
preprocessing_num_workers: 30

### output
output_dir: saves/deepseek-moe/full/pretrain
logging_steps: 1000
save_steps: 50000
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 50000

但是报如下错误:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/paddlejob/workspace/zhangyuanpei/slm_train/src/llamafactory/launcher.py", line 23, in <module>
[rank0]:     launch()
[rank0]:   File "/root/paddlejob/workspace/zhangyuanpei/slm_train/src/llamafactory/launcher.py", line 19, in launch
[rank0]:     run_exp()
[rank0]:   File "/root/paddlejob/workspace/zhangyuanpei/slm_train/src/llamafactory/train/tuner.py", line 57, in run_exp
[rank0]:     run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
[rank0]:   File "/root/paddlejob/workspace/zhangyuanpei/slm_train/src/llamafactory/train/pt/workflow.py", line 63, in run_pt
[rank0]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3606, in training_step
[rank0]:     self.accelerator.backward(loss, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2238, in backward
[rank0]:     self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 195, in backward
[rank0]:     self.engine.step()
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2213, in step
[rank0]:     self._take_model_step(lr_kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2119, in _take_model_step
[rank0]:     self.optimizer.step()
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2095, in step
[rank0]:     self._optimizer_step(sub_group_id)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 979, in _optimizer_step
[rank0]:     self.optimizer.step()
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
[rank0]:     return wrapped(*args, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank0]:     out = func(*args, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
[rank0]:     ret = func(self, *args, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/optim/adamw.py", line 177, in step
[rank0]:     has_complex = self._init_group(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/optim/adamw.py", line 128, in _init_group
[rank0]:     state["exp_avg_sq"] = torch.zeros_like(
[rank0]: RuntimeError: r.nvmlDeviceGetNvLinkRemoteDeviceType_ INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1712608839953/work/c10/cuda/driver_api.cpp":27, please report a bug to PyTorch. Can't find nvmlDeviceGetNvLinkRemoteDeviceType: /home/opt/gpuproxy/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType

该环境跑qwen2.5-0.5B模型的full pretrain是没有任何问题的，当时的参数为：

### model
model_name_or_path: model/Qwen2.5-0.5B
trust_remote_code: true

### method
stage: pt
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json  # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]

### dataset
dataset: c4_text
cutoff_len: 2048
overwrite_cache: true
preprocessing_num_workers: 30

### output
output_dir: saves/Qwen2.5-0.5B/full/pretrain
logging_steps: 1000
save_steps: 50000
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 50000

所以我在想是不是deepspeed配置有点问题，或者是我少了其他哪些步骤

Others

No response

The text was updated successfully, but these errors were encountered:

zyp-byte · 2025-03-05T09:07:17Z

拿Qwen1.5-MoE-A2.7B试了一下，报了同样的错误。如果llamafactory支持moe的full pretrain的话，我应该怎么配置我的参数呢？

hiyouga · 2025-03-05T09:38:24Z

看起来是机器问题

zyp-byte added bug Something isn't working pending This problem is yet to be addressed labels Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepseek-moe-16B预训练问题 #7165

deepseek-moe-16B预训练问题 #7165

zyp-byte commented Mar 5, 2025

zyp-byte commented Mar 5, 2025

hiyouga commented Mar 5, 2025

deepseek-moe-16B预训练问题 #7165

deepseek-moe-16B预训练问题 #7165

Comments

zyp-byte commented Mar 5, 2025

Reminder

System Info

Reproduction

Others

zyp-byte commented Mar 5, 2025

hiyouga commented Mar 5, 2025