Adding Llama 1B and 3B model. #1040

githubsgi · 2025-04-01T18:28:30Z

Based on the HuggingFace models ( https://huggingface.co/meta-llama/Llama-3.2-1B and https://huggingface.co/meta-llama/Llama-3.2-3B ) .

1B:

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)

LlamaConfig {
  "_attn_implementation_autoset": true,
  "_name_or_path": "meta-llama/Llama-3.2-1B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 16,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 32.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.47.1",
  "use_cache": true,
  "vocab_size": 128256
}

3B:

 LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 3072)
    (layers): ModuleList(
      (0-27): 28 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (k_proj): Linear(in_features=3072, out_features=1024, bias=False)
          (v_proj): Linear(in_features=3072, out_features=1024, bias=False)
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=3072, out_features=8192, bias=False)
          (up_proj): Linear(in_features=3072, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((3072,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=3072, out_features=128256, bias=False)
)

LlamaConfig {
  "_attn_implementation_autoset": true,
  "_name_or_path": "meta-llama/Llama-3.2-3B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 24,
  "num_hidden_layers": 28,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 32.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.47.1",
  "use_cache": true,
  "vocab_size": 128256
}

tianyu-l

Would you please include the following in the PR description?

source of truth
verified model size from terminal output for each new config

githubsgi · 2025-04-14T18:48:48Z

@tianyu-l , the source is HuggingFace as mentioned above. I am seeing TorchTitan output as follows.

1B: INFO - Model llama3 1B size: 1,397,819,392 total parameters
3B: INFO - Model llama3 3B size: 4,399,475,712 total parameters

tianyu-l

1B: INFO - Model llama3 1B size: 1,397,819,392 total parameters
3B: INFO - Model llama3 3B size: 4,399,475,712 total parameters

I don't think they look right -- they are not close to 1B and 3B.
I believe you have to tune the ffn_dim_multiplier arg to yield the right intermediate_size. Please see my inline comments.

tianyu-l · 2025-04-14T23:30:25Z

torchtitan/models/llama/__init__.py

+        n_layers=16,
+        n_heads=32,
+        n_kv_heads=8,
+        ffn_dim_multiplier=1.3,


HF model definition is including intermediate_size directly, whereas torchtitan is using ffn_dim_multiplier to infer. Using

Suggested change

ffn_dim_multiplier=1.3,

ffn_dim_multiplier=1.4,

gives the right intermediate_size=8192 and

982,386,688 total parameters

tianyu-l · 2025-04-14T23:31:36Z

torchtitan/models/llama/__init__.py

+        n_layers=28,
+        n_heads=24,
+        n_kv_heads=8,
+        ffn_dim_multiplier=1.3,


Similarly, deleting

Suggested change

ffn_dim_multiplier=1.3,

gives intermediate_size=8192 and

2,832,608,256 total parameters

This page shows a different set of numbers for parameters - 1.23B and 3.21B respectively for 1B and 3B. Interesting numbers !

update: I found the issue -- the number I gave was when using a test tokenizer which has smaller vocab size, and hence make the embedding/output module small. After switching to the official tokenizer which has vocab size 128256, the numbers match perfectly

1B: 1.23B parameters

3B: 3.21 parameters

The caveat is that, this is assuming we do weight tying on the embedding and output modules and define the weight only once. Currently in torchtitan this is not supported -- it'll take some work on the parallelism side which I haven't tried.

Adding Llama 1B and 3B model.

b94df2b

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 1, 2025

githubsgi mentioned this pull request Apr 1, 2025

Any plan to add Llama 1B and/or 3B models ? #1026

Open

tianyu-l reviewed Apr 2, 2025

View reviewed changes

tianyu-l requested changes Apr 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Llama 1B and 3B model. #1040

Adding Llama 1B and 3B model. #1040

githubsgi commented Apr 1, 2025 •

edited

Loading

tianyu-l left a comment

githubsgi commented Apr 14, 2025

tianyu-l left a comment

tianyu-l Apr 14, 2025

tianyu-l Apr 14, 2025

githubsgi Apr 15, 2025

tianyu-l Apr 15, 2025 •

edited

Loading

Adding Llama 1B and 3B model. #1040

Are you sure you want to change the base?

Adding Llama 1B and 3B model. #1040

Conversation

githubsgi commented Apr 1, 2025 • edited Loading

tianyu-l left a comment

Choose a reason for hiding this comment

githubsgi commented Apr 14, 2025

tianyu-l left a comment

Choose a reason for hiding this comment

tianyu-l Apr 14, 2025

Choose a reason for hiding this comment

tianyu-l Apr 14, 2025

Choose a reason for hiding this comment

githubsgi Apr 15, 2025

Choose a reason for hiding this comment

tianyu-l Apr 15, 2025 • edited Loading

Choose a reason for hiding this comment

githubsgi commented Apr 1, 2025 •

edited

Loading

tianyu-l Apr 15, 2025 •

edited

Loading