We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llamafactory
想要full pretrain deepseek-moe-16B模型,参数为:
### model model_name_or_path: model/deepseek-moe-16b-base trust_remote_code: true ### method stage: pt do_train: true finetuning_type: full deepspeed: examples/deepspeed/ds_z3_config.json # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json] ### dataset dataset: c4_text cutoff_len: 2048 overwrite_cache: true preprocessing_num_workers: 30 ### output output_dir: saves/deepseek-moe/full/pretrain logging_steps: 1000 save_steps: 50000 plot_loss: true overwrite_output_dir: true ### train per_device_train_batch_size: 1 gradient_accumulation_steps: 2 learning_rate: 1.0e-4 num_train_epochs: 2.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000 ### eval val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 50000
但是报如下错误:
[rank0]: Traceback (most recent call last): [rank0]: File "/root/paddlejob/workspace/zhangyuanpei/slm_train/src/llamafactory/launcher.py", line 23, in <module> [rank0]: launch() [rank0]: File "/root/paddlejob/workspace/zhangyuanpei/slm_train/src/llamafactory/launcher.py", line 19, in launch [rank0]: run_exp() [rank0]: File "/root/paddlejob/workspace/zhangyuanpei/slm_train/src/llamafactory/train/tuner.py", line 57, in run_exp [rank0]: run_pt(model_args, data_args, training_args, finetuning_args, callbacks) [rank0]: File "/root/paddlejob/workspace/zhangyuanpei/slm_train/src/llamafactory/train/pt/workflow.py", line 63, in run_pt [rank0]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) [rank0]: File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train [rank0]: return inner_training_loop( [rank0]: File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop [rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch) [rank0]: File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3606, in training_step [rank0]: self.accelerator.backward(loss, **kwargs) [rank0]: File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2238, in backward [rank0]: self.deepspeed_engine_wrapped.backward(loss, **kwargs) [rank0]: File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 195, in backward [rank0]: self.engine.step() [rank0]: File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2213, in step [rank0]: self._take_model_step(lr_kwargs) [rank0]: File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2119, in _take_model_step [rank0]: self.optimizer.step() [rank0]: File "/opt/conda/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn [rank0]: ret_val = func(*args, **kwargs) [rank0]: File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2095, in step [rank0]: self._optimizer_step(sub_group_id) [rank0]: File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 979, in _optimizer_step [rank0]: self.optimizer.step() [rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper [rank0]: return wrapped(*args, **kwargs) [rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper [rank0]: out = func(*args, **kwargs) [rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad [rank0]: ret = func(self, *args, **kwargs) [rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/optim/adamw.py", line 177, in step [rank0]: has_complex = self._init_group( [rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/optim/adamw.py", line 128, in _init_group [rank0]: state["exp_avg_sq"] = torch.zeros_like( [rank0]: RuntimeError: r.nvmlDeviceGetNvLinkRemoteDeviceType_ INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1712608839953/work/c10/cuda/driver_api.cpp":27, please report a bug to PyTorch. Can't find nvmlDeviceGetNvLinkRemoteDeviceType: /home/opt/gpuproxy/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType
该环境跑qwen2.5-0.5B模型的full pretrain是没有任何问题的,当时的参数为:
### model model_name_or_path: model/Qwen2.5-0.5B trust_remote_code: true ### method stage: pt do_train: true finetuning_type: full deepspeed: examples/deepspeed/ds_z3_config.json # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json] ### dataset dataset: c4_text cutoff_len: 2048 overwrite_cache: true preprocessing_num_workers: 30 ### output output_dir: saves/Qwen2.5-0.5B/full/pretrain logging_steps: 1000 save_steps: 50000 plot_loss: true overwrite_output_dir: true ### train per_device_train_batch_size: 1 gradient_accumulation_steps: 2 learning_rate: 1.0e-4 num_train_epochs: 2.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000 ### eval val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 50000
所以我在想是不是deepspeed配置有点问题,或者是我少了其他哪些步骤
No response
The text was updated successfully, but these errors were encountered:
拿Qwen1.5-MoE-A2.7B试了一下,报了同样的错误。如果llamafactory支持moe的full pretrain的话,我应该怎么配置我的参数呢?
Sorry, something went wrong.
看起来是机器问题
No branches or pull requests
Reminder
System Info
llamafactory
version: 0.9.2.dev0Reproduction
想要full pretrain deepseek-moe-16B模型,参数为:
但是报如下错误:
该环境跑qwen2.5-0.5B模型的full pretrain是没有任何问题的,当时的参数为:
所以我在想是不是deepspeed配置有点问题,或者是我少了其他哪些步骤
Others
No response
The text was updated successfully, but these errors were encountered: