-
Notifications
You must be signed in to change notification settings - Fork 346
Adding Llama 1B and 3B model. #1040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you please include the following in the PR description?
- source of truth
- verified model size from terminal output for each new config
@tianyu-l , the source is HuggingFace as mentioned above. I am seeing TorchTitan output as follows. 1B: INFO - Model llama3 1B size: 1,397,819,392 total parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1B: INFO - Model llama3 1B size: 1,397,819,392 total parameters
3B: INFO - Model llama3 3B size: 4,399,475,712 total parameters
I don't think they look right -- they are not close to 1B and 3B.
I believe you have to tune the ffn_dim_multiplier
arg to yield the right intermediate_size
. Please see my inline comments.
n_layers=16, | ||
n_heads=32, | ||
n_kv_heads=8, | ||
ffn_dim_multiplier=1.3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HF model definition is including intermediate_size
directly, whereas torchtitan is using ffn_dim_multiplier
to infer. Using
ffn_dim_multiplier=1.3, | |
ffn_dim_multiplier=1.4, |
gives the right intermediate_size=8192
and
982,386,688 total parameters
n_layers=28, | ||
n_heads=24, | ||
n_kv_heads=8, | ||
ffn_dim_multiplier=1.3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, deleting
ffn_dim_multiplier=1.3, |
gives intermediate_size=8192
and
2,832,608,256 total parameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This page shows a different set of numbers for parameters - 1.23B and 3.21B respectively for 1B and 3B. Interesting numbers !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: I found the issue -- the number I gave was when using a test tokenizer which has smaller vocab size, and hence make the embedding/output module small. After switching to the official tokenizer which has vocab size 128256, the numbers match perfectly
- 1B: 1.23B parameters
- 3B: 3.21 parameters
The caveat is that, this is assuming we do weight tying on the embedding and output modules and define the weight only once. Currently in torchtitan this is not supported -- it'll take some work on the parallelism side which I haven't tried.
Based on the HuggingFace models ( https://huggingface.co/meta-llama/Llama-3.2-1B and https://huggingface.co/meta-llama/Llama-3.2-3B ) .
1B:
3B: