You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(llm_demo) (base) root@f58359457528:/home/guohuanjun/LLaMA-Factory# llamafactory-cli train /home/guohuanjun/LLaMA-Factory/examples/train_lora/qwen3b_lora_pretrain.yaml
[2025-03-10 13:04:56,808] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[INFO|2025-03-10 13:05:00] llamafactory.cli:157 >> Initializing distributed tasks at: 127.0.0.1:23914
W0310 13:05:01.749000 2958665 site-packages/torch/distributed/run.py:792]
W0310 13:05:01.749000 2958665 site-packages/torch/distributed/run.py:792] *****************************************
W0310 13:05:01.749000 2958665 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0310 13:05:01.749000 2958665 site-packages/torch/distributed/run.py:792] *****************************************
[2025-03-10 13:05:07,089] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-10 13:05:07,098] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-10 13:05:07,100] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-10 13:05:07,101] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING|2025-03-10 13:05:08] llamafactory.hparams.parser:162 >> `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
[INFO|2025-03-10 13:05:08] llamafactory.hparams.parser:384 >> Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|configuration_utils.py:697] 2025-03-10 13:05:08,633 >> loading configuration file /home/guohuanjun/LLaMA-Factory/Qwen2.5-3B-Instruct/config.json
[INFO|configuration_utils.py:771] 2025-03-10 13:05:08,634 >> Model config Qwen2Config {
"_name_or_path": "/home/guohuanjun/LLaMA-Factory/Qwen2.5-3B-Instruct",
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 32768,
"max_window_layers": 70,
"model_type": "qwen2",
"num_attention_heads": 16,
"num_hidden_layers": 36,
"num_key_value_heads": 2,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.49.0",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 151936
}
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,636 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,636 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,636 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,636 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,636 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,636 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,636 >> loading file chat_template.jinja
[INFO|2025-03-10 13:05:08] llamafactory.hparams.parser:384 >> Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|2025-03-10 13:05:08] llamafactory.hparams.parser:384 >> Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|2025-03-10 13:05:08] llamafactory.hparams.parser:384 >> Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|tokenization_utils_base.py:2313] 2025-03-10 13:05:08,955 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:697] 2025-03-10 13:05:08,956 >> loading configuration file /home/guohuanjun/LLaMA-Factory/Qwen2.5-3B-Instruct/config.json
[INFO|configuration_utils.py:771] 2025-03-10 13:05:08,957 >> Model config Qwen2Config {
"_name_or_path": "/home/guohuanjun/LLaMA-Factory/Qwen2.5-3B-Instruct",
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 32768,
"max_window_layers": 70,
"model_type": "qwen2",
"num_attention_heads": 16,
"num_hidden_layers": 36,
"num_key_value_heads": 2,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.49.0",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 151936
}
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,957 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,957 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,957 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,957 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,957 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,957 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2048] 2025-03-10 13:05:08,957 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2313] 2025-03-10 13:05:09,274 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|2025-03-10 13:05:09] llamafactory.data.template:162 >> `template` was not specified, try parsing the chat template from the tokenizer.
[INFO|2025-03-10 13:05:09] llamafactory.data.loader:157 >> Loading dataset /home/guohuanjun/LLaMA-Factory/data/rank_00080.json...
[rank2]:[W310 13:05:09.586915779 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank3]:[W310 13:05:09.607403385 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank1]:[W310 13:05:09.607456036 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
Converting format of dataset (num_proc=16): 100%|████████████| 1000/1000 [00:00<00:00, 5434.28 examples/s]
[rank0]:[W310 13:05:10.773668125 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank2]: Traceback (most recent call last):
[rank2]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank2]: launch()
[rank2]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank2]: run_exp()
[rank2]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/tuner.py", line 100, in run_exp
[rank2]: _training_function(config={"args": args, "callbacks": callbacks})
[rank2]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/tuner.py", line 66, in _training_function
[rank2]: run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
[rank2]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/pt/workflow.py", line 46, in run_pt
[rank2]: dataset_module = get_dataset(template, model_args, data_args, training_args, stage="pt", **tokenizer_module)
[rank2]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/data/loader.py", line 318, in get_dataset
[rank2]: with training_args.main_process_first(desc="load dataset"):
[rank2]: File "/root/miniconda3/envs/llm_demo/lib/python3.10/contextlib.py", line 135, in __enter__
[rank2]: return next(self.gen)
[rank2]: File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/transformers/training_args.py", line 2496, in main_process_first
[rank2]: dist.barrier()
[rank2]: File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank2]: return func(*args, **kwargs)
[rank2]: File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4551, in barrier
[rank2]: work = group.barrier(opts=opts)
[rank2]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank2]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank2]: Last error:
[rank2]: Error while creating shared memory segment /dev/shm/nccl-h93xZ8 (size 9637888)
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank1]: launch()
[rank1]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank1]: run_exp()
[rank1]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/tuner.py", line 100, in run_exp
[rank1]: _training_function(config={"args": args, "callbacks": callbacks})
[rank1]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/tuner.py", line 66, in _training_function
[rank1]: run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
[rank1]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/pt/workflow.py", line 46, in run_pt
[rank1]: dataset_module = get_dataset(template, model_args, data_args, training_args, stage="pt", **tokenizer_module)
[rank1]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/data/loader.py", line 318, in get_dataset
[rank1]: with training_args.main_process_first(desc="load dataset"):
[rank1]: File "/root/miniconda3/envs/llm_demo/lib/python3.10/contextlib.py", line 135, in __enter__
[rank1]: return next(self.gen)
[rank1]: File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/transformers/training_args.py", line 2496, in main_process_first
[rank1]: dist.barrier()
[rank1]: File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4551, in barrier
[rank1]: work = group.barrier(opts=opts)
[rank1]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank1]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank1]: Last error:
[rank1]: Error while creating shared memory segment /dev/shm/nccl-LeXlEJ (size 9637888)
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank3]: launch()
[rank3]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank3]: run_exp()
[rank3]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/tuner.py", line 100, in run_exp
[rank3]: _training_function(config={"args": args, "callbacks": callbacks})
[rank3]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/tuner.py", line 66, in _training_function
[rank3]: run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
[rank3]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/pt/workflow.py", line 46, in run_pt
[rank3]: dataset_module = get_dataset(template, model_args, data_args, training_args, stage="pt", **tokenizer_module)
[rank3]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/data/loader.py", line 318, in get_dataset
[rank3]: with training_args.main_process_first(desc="load dataset"):
[rank3]: File "/root/miniconda3/envs/llm_demo/lib/python3.10/contextlib.py", line 135, in __enter__
[rank3]: return next(self.gen)
[rank3]: File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/transformers/training_args.py", line 2496, in main_process_first
[rank3]: dist.barrier()
[rank3]: File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank3]: return func(*args, **kwargs)
[rank3]: File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4551, in barrier
[rank3]: work = group.barrier(opts=opts)
[rank3]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank3]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank3]: Last error:
[rank3]: Error while creating shared memory segment /dev/shm/nccl-glXHZ7 (size 9637888)
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank0]: launch()
[rank0]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/tuner.py", line 100, in run_exp
[rank0]: _training_function(config={"args": args, "callbacks": callbacks})
[rank0]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/tuner.py", line 66, in _training_function
[rank0]: run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
[rank0]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/train/pt/workflow.py", line 46, in run_pt
[rank0]: dataset_module = get_dataset(template, model_args, data_args, training_args, stage="pt", **tokenizer_module)
[rank0]: File "/home/guohuanjun/LLaMA-Factory/src/llamafactory/data/loader.py", line 318, in get_dataset
[rank0]: with training_args.main_process_first(desc="load dataset"):
[rank0]: File "/root/miniconda3/envs/llm_demo/lib/python3.10/contextlib.py", line 142, in __exit__
[rank0]: next(self.gen)
[rank0]: File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/transformers/training_args.py", line 2505, in main_process_first
[rank0]: dist.barrier()
[rank0]: File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4551, in barrier
[rank0]: work = group.barrier(opts=opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank0]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank0]: Last error:
[rank0]: Error while creating shared memory segment /dev/shm/nccl-VYfIAl (size 9637888)
[rank0]:[W310 13:05:11.651204983 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0310 13:05:12.285000 2958665 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2958758 closing signal SIGTERM
W0310 13:05:12.286000 2958665 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2958759 closing signal SIGTERM
W0310 13:05:12.286000 2958665 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2958761 closing signal SIGTERM
E0310 13:05:12.350000 2958665 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 2 (pid: 2958760) of binary: /root/miniconda3/envs/llm_demo/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/llm_demo/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/llm_demo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/guohuanjun/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-03-10_13:05:12
host : f58359457528
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 2958760)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Others
No response
The text was updated successfully, but these errors were encountered:
Reminder
System Info
llamafactory
version: 0.9.2.dev0Reproduction
Others
No response
The text was updated successfully, but these errors were encountered: