-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Roadmap] veRL Development Roadmap #22
Comments
Hi @PeterSH6 @eric-haibin-lin, I am exploring this project recently and I am happy to contribute to it. Are there any recommended good first issues that I can work on to get myself familiar with it? I am interested in issues related to model parallelism and serving. |
Hi @leifeng666 thanks for your interest. One good issue is to support more models with model parallelism / Megatron. Currently we have an example of Megatron deepseek-llm. I think a good path towards more model support would be:
You can find reference Megatron commands and logs in #210 . Other ways to verify the correctness would be just compare Megatron tensor parallel runs with FSDP runs. https://verl.readthedocs.io/en/latest/advance/megatron_extension.html |
Is there a timeline for this? Thanks |
Themes
We categorized our roadmap into 8 themes: Broad Model Support, Regular Update, More RL Algorithms support, Dataset Coverage, Plugin Support, Scaling Up RL, More LLM Infrastructure Support, Wide Hardware Coverage
Broad Model Support
To add a new model in veRL, the model should satisfy the following requirements:
dummy_hf
load format to run the new modeldtensor_weight_loader
for the model to transfer actor weights from FSDP checkpoint to vLLM model. See FSDP Document for more informationParallelModel
similar to modeling_llama_megatron.py , implement some corresponding checkpoint_utils to load checkpoints from the huggingface, and implement the megatron_weight_loader to transfer actor weights from ParallelModel directly to the vLLM model. See Megatron-LM Document for more informationRegular Update
postition_ids
to support remove padding in transformers models (transformers >= v4.45) [misc] feat: spport rmpad/data-packing in FSDP with transformers #91resource_pool
colocate) [misc] fix: weak reference of WorkerDict in RayTrainer #65resource_pool
.More RL Algorithms Support
Make sure the algorithms can converge on some math datasets (e.g., GSM8k)
Dataset Coverage
Plugin Support
Scaling up RL
More LLM Infrastructure Support
LLM Training Infrastructure
LLM Serving Infrastructure
At present, our project supports vLLM using the SPMD execution paradigm. This means we've eliminated the need for a standalone single-controller process (known as
LLMEngine
) by integrating its functionality directly into the multiple worker processes, making the system SPMD.Wide Hardware Coverage
Supporting a new hardware type in our project involves the following requirements:
ray.utils.placement_group
functionality.ParallelModel
Is non-RmPad version model and RmPad verison mdoel interchangeable? #20The text was updated successfully, but these errors were encountered: