Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

单机多卡(4 x 3090)Linux 系统 使用默认的llamafactory-cli train /homeqwen3b_lora_pretrain.yaml 报错 #7233

Open
1 task done
Johnnythefool opened this issue Mar 10, 2025 · 4 comments
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@Johnnythefool
Copy link

Reminder

  • I have read the above rules and searched the existing issues.

System Info

  • llamafactory version: 0.9.2.dev0
  • Platform: Linux-5.15.0-117-generic-x86_64-with-glibc2.35
  • Python version: 3.10.16
  • PyTorch version: 2.6.0+cu124 (GPU)
  • Transformers version: 4.49.0
  • Datasets version: 3.2.0
  • Accelerate version: 1.2.1
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA GeForce RTX 3090
  • GPU number: 4
  • GPU memory: 23.68GB
  • DeepSpeed version: 0.16.4

Reproduction

(llm_demo) (base) root@f58359457528:/home/guohuanjun/LLaMA-Factory# llamafactory-cli train /home/guohuanjun/LLaMA-Factory/examples/train_lora/qwen3b_lora_pretrain.yaml
[2025-03-10 13:04:56,808] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[INFO|2025-03-10 13:05:00] llamafactory.cli:157 >> Initializing distributed tasks at: 127.0.0.1:23914
W0310 13:05:01.749000 2958665 site-packages/torch/distributed/run.py:792] 
W0310 13:05:01.749000 2958665 site-packages/torch/distributed/run.py:792] *****************************************
W0310 13:05:01.749000 2958665 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0310 13:05:01.749000 2958665 site-packages/torch/distributed/run.py:792] *****************************************
[2025-03-10 13:05:07,089] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-10 13:05:07,098] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-10 13:05:07,100] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-10 13:05:07,101] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING|2025-03-10 13:05:08] llamafactory.hparams.parser:162 >> `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
[INFO|2025-03-10 13:05:08] llamafactory.hparams.parser:384 >> Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|configuration_utils.py:697] 2025-03-10 13:05:08,633 >> loading configuration file /home/guohuanjun/LLaMA-Factory/Qwen2.5-3B-Instruct/config.json
[INFO|configuration_utils.py:771] 2025-03-10 13:05:08,634 >> Model config Qwen2Config {
  "_name_or_path": "/home/guohuanjun/LLaMA-Factory/Qwen2.5-3B-Instruct",
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 32768,
  "max_window_layers": 70,
  "model_type": "qwen2",
  "num_attention_heads": 16,
  "num_hidden_layers": 36,
  "num_key_value_heads": 2,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000.0,
  "sliding_window": 32768,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.49.0",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 151936
}

[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,636 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,636 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,636 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,636 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,636 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,636 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,636 >> loading file chat_template.jinja
[INFO|2025-03-10 13:05:08] llamafactory.hparams.parser:384 >> Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|2025-03-10 13:05:08] llamafactory.hparams.parser:384 >> Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|2025-03-10 13:05:08] llamafactory.hparams.parser:384 >> Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|tokenization_utils_base.py:2313] 2025-03-10 13:05:08,955 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:697] 2025-03-10 13:05:08,956 >> loading configuration file /home/guohuanjun/LLaMA-Factory/Qwen2.5-3B-Instruct/config.json
[INFO|configuration_utils.py:771] 2025-03-10 13:05:08,957 >> Model config Qwen2Config {
  "_name_or_path": "/home/guohuanjun/LLaMA-Factory/Qwen2.5-3B-Instruct",
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 32768,
  "max_window_layers": 70,
  "model_type": "qwen2",
  "num_attention_heads": 16,
  "num_hidden_layers": 36,
  "num_key_value_heads": 2,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000.0,
  "sliding_window": 32768,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.49.0",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 151936
}

[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,957 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,957 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,957 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,957 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,957 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,957 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,957 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2313] 2025-03-10 13:05:09,274 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|2025-03-10 13:05:09] llamafactory.data.template:162 >> `template` was not specified, try parsing the chat template from the tokenizer.
[INFO|2025-03-10 13:05:09] llamafactory.data.loader:157 >> Loading dataset /home/guohuanjun/LLaMA-Factory/data/rank_00080.json...
[rank2]:[W310 13:05:09.586915779 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 2]  using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank3]:[W310 13:05:09.607403385 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 3]  using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank1]:[W310 13:05:09.607456036 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
Converting format of dataset (num_proc=16): 100%|████████████| 1000/1000 [00:00<00:00, 5434.28 examples/s]
[rank0]:[W310 13:05:10.773668125 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank2]:     launch()
[rank2]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank2]:     run_exp()
[rank2]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/tuner.py", line 100, in run_exp
[rank2]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank2]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/tuner.py", line 66, in _training_function
[rank2]:     run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
[rank2]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/pt/workflow.py", line 46, in run_pt
[rank2]:     dataset_module = get_dataset(template, model_args, data_args, training_args, stage="pt", **tokenizer_module)
[rank2]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/data/loader.py", line 318, in get_dataset
[rank2]:     with training_args.main_process_first(desc="load dataset"):
[rank2]:   File "/root/miniconda3/envs/llm_demo/lib/python3.10/contextlib.py", line 135, in __enter__
[rank2]:     return next(self.gen)
[rank2]:   File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/transformers/training_args.py", line 2496, in main_process_first
[rank2]:     dist.barrier()
[rank2]:   File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank2]:     return func(*args, **kwargs)
[rank2]:   File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4551, in barrier
[rank2]:     work = group.barrier(opts=opts)
[rank2]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank2]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank2]: Last error:
[rank2]: Error while creating shared memory segment /dev/shm/nccl-h93xZ8 (size 9637888)
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank1]:     launch()
[rank1]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank1]:     run_exp()
[rank1]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/tuner.py", line 100, in run_exp
[rank1]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank1]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/tuner.py", line 66, in _training_function
[rank1]:     run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
[rank1]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/pt/workflow.py", line 46, in run_pt
[rank1]:     dataset_module = get_dataset(template, model_args, data_args, training_args, stage="pt", **tokenizer_module)
[rank1]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/data/loader.py", line 318, in get_dataset
[rank1]:     with training_args.main_process_first(desc="load dataset"):
[rank1]:   File "/root/miniconda3/envs/llm_demo/lib/python3.10/contextlib.py", line 135, in __enter__
[rank1]:     return next(self.gen)
[rank1]:   File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/transformers/training_args.py", line 2496, in main_process_first
[rank1]:     dist.barrier()
[rank1]:   File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4551, in barrier
[rank1]:     work = group.barrier(opts=opts)
[rank1]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank1]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank1]: Last error:
[rank1]: Error while creating shared memory segment /dev/shm/nccl-LeXlEJ (size 9637888)
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank3]:     launch()
[rank3]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank3]:     run_exp()
[rank3]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/tuner.py", line 100, in run_exp
[rank3]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank3]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/tuner.py", line 66, in _training_function
[rank3]:     run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
[rank3]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/pt/workflow.py", line 46, in run_pt
[rank3]:     dataset_module = get_dataset(template, model_args, data_args, training_args, stage="pt", **tokenizer_module)
[rank3]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/data/loader.py", line 318, in get_dataset
[rank3]:     with training_args.main_process_first(desc="load dataset"):
[rank3]:   File "/root/miniconda3/envs/llm_demo/lib/python3.10/contextlib.py", line 135, in __enter__
[rank3]:     return next(self.gen)
[rank3]:   File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/transformers/training_args.py", line 2496, in main_process_first
[rank3]:     dist.barrier()
[rank3]:   File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4551, in barrier
[rank3]:     work = group.barrier(opts=opts)
[rank3]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank3]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank3]: Last error:
[rank3]: Error while creating shared memory segment /dev/shm/nccl-glXHZ7 (size 9637888)
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank0]:     launch()
[rank0]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]:     run_exp()
[rank0]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/tuner.py", line 100, in run_exp
[rank0]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank0]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/tuner.py", line 66, in _training_function
[rank0]:     run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
[rank0]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/pt/workflow.py", line 46, in run_pt
[rank0]:     dataset_module = get_dataset(template, model_args, data_args, training_args, stage="pt", **tokenizer_module)
[rank0]:   File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/data/loader.py", line 318, in get_dataset
[rank0]:     with training_args.main_process_first(desc="load dataset"):
[rank0]:   File "/root/miniconda3/envs/llm_demo/lib/python3.10/contextlib.py", line 142, in __exit__
[rank0]:     next(self.gen)
[rank0]:   File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/transformers/training_args.py", line 2505, in main_process_first
[rank0]:     dist.barrier()
[rank0]:   File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4551, in barrier
[rank0]:     work = group.barrier(opts=opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank0]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank0]: Last error:
[rank0]: Error while creating shared memory segment /dev/shm/nccl-VYfIAl (size 9637888)
[rank0]:[W310 13:05:11.651204983 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0310 13:05:12.285000 2958665 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2958758 closing signal SIGTERM
W0310 13:05:12.286000 2958665 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2958759 closing signal SIGTERM
W0310 13:05:12.286000 2958665 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2958761 closing signal SIGTERM
E0310 13:05:12.350000 2958665 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 2 (pid: 2958760) of binary: /root/miniconda3/envs/llm_demo/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/llm_demo/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/guohuanjun/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-03-10_13:05:12
  host      : f58359457528
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 2958760)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Others

No response

@Johnnythefool Johnnythefool added bug Something isn't working pending This problem is yet to be addressed labels Mar 10, 2025
@Johnnythefool
Copy link
Author

打扰了,请问是不是没有装某个库呀?还是想要分布式训练必须用Deepspeed等方法呢?谢谢~

@chenxinxi
Copy link

我靠,跟我一模一样,单卡webui可以训,多卡就出问题,请问解决了吗

@Johnnythefool
Copy link
Author

我靠,跟我一模一样,单卡webui可以训,多卡就出问题,请问解决了吗

兄弟你试下在yaml文件加入ddp_backend: gloo,应该可以解决问题,但是似乎gpu虽然能同时训练了,但是显存和利用率只能跑到三分之一,我还在找解决方法

@chenxinxi
Copy link

Image
兄弟我现在是这么个问题,nccl连接成功后,就卡住。。不知道咋解决,一直在找解决办法
命令:llamafactory-cli train /home/xiaosong/CXB-file/test1/LLaMA-Factory/examples/train_lora/llama3_lora_pretrain.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

2 participants