[Roadmap] veRL Development Roadmap #22

PeterSH6 · 2024-11-22T09:47:53Z

Themes

We categorized our roadmap into 8 themes: Broad Model Support, Regular Update, More RL Algorithms support, Dataset Coverage, Plugin Support, Scaling Up RL, More LLM Infrastructure Support, Wide Hardware Coverage

Broad Model Support

To add a new model in veRL, the model should satisfy the following requirements:

The models are supported in vLLM and huggingface transformers. Then you can directly use dummy_hf load format to run the new model
[Optional for DTensor] For FSDP Backend, implement the dtensor_weight_loader for the model to transfer actor weights from FSDP checkpoint to vLLM model. See FSDP Document for more information
For Megatron Backend, users need to implement the ParallelModel similar to modeling_llama_megatron.py , implement some corresponding checkpoint_utils to load checkpoints from the huggingface, and implement the megatron_weight_loader to transfer actor weights from ParallelModel directly to the vLLM model. See Megatron-LM Document for more information

Regular Update

Use postition_idsto support remove padding in transformers models (transformers >= v4.45) [misc] feat: spport rmpad/data-packing in FSDP with transformers #91
Upgrade the vLLM version to the latest -> Integrate SPMD-version of vLLM
Ray upgrade to latest version (test multiple resource_pool colocate) [misc] fix: weak reference of WorkerDict in RayTrainer #65
- An Megatron Example for multiple WorkerGroup on same resource_pool.
Megatron-LM/MCore Upgrade and GPTModel Support [RFC] Megatron-LM and MCore maintaining issues for veRL #15

More RL Algorithms Support

Make sure the algorithms can converge on some math datasets (e.g., GSM8k)

GRPO
Online DPO
Safe-RLHF (Multiple rm)
ReMax

Dataset Coverage

Plugin Support

Integrate SandBox and its corresponding datasets for Code Generation tasks

Scaling up RL

Context Parallel
- Deepspeed Ulyssess [misc][Long Context] feat: support ulysses for long context training #109
- Ring Attention
Integrate Ray Compiled Graphs (aDAGs) to speedup data transfer
Support FSDP HybridShard
Aggressive offload techniques for all models
Support vLLM Rollout utilizes larger TP size than Actor model
Support Pipeline parallelism in rollout generation (in vllm or other LLM serving infra)

More LLM Infrastructure Support

LLM Training Infrastructure

Support TorchTitan for TP + PP parallelism
Support VeScale for Auto-Parallelism training

LLM Serving Infrastructure

At present, our project supports vLLM using the SPMD execution paradigm. This means we've eliminated the need for a standalone single-controller process (known as LLMEngine) by integrating its functionality directly into the multiple worker processes, making the system SPMD.

Basic Tutorial: Basic Tutorial: Adding a New LLM Inference/Serving Backend #21
Support SGLang (offline + SPMD) for rollout generation. Reference: [Feature] several features for veRL integration sgl-project/sglang#2736
Support vLLM-SPMD version: [testing][rollout] feat: support integration of vllm>=0.7.0 (spmd-version) #209
Support TensorRT-LLM for rollout generation

Wide Hardware Coverage

Supporting a new hardware type in our project involves the following requirements:

Ray compatibility: The hardware type must be supported by the Ray framework, allowing it to be recognized and managed through the ray.utils.placement_group functionality.
LLM infra and transformers support: To leverage the new hardware effectively, it is crucial that both LLM infra (e.g., vLLM, torch, Megatron-LM and others) and the transformers library provide native support for the hardware type.
CUDA kernel replacement: We need to replace the CUDA kernels currently used in FSDP and Megatron-LM with the corresponding kernels specific to the new hardware.

Support Ascend NPUs
- vLLM Ascend Support [Feature]: vllm support for Ascend NPU vllm-project/vllm#6728
- Megatron-LM -> MindSpeed
Low-end NVIDIA GPUs (e.g., Volta, Tesla series)
- For Megatron-LM, implement no-rmpad and no flash-attention version of ParallelModel Is non-RmPad version model and RmPad verison mdoel interchangeable? #20

The text was updated successfully, but these errors were encountered:

leifeng666 · 2025-02-01T19:12:28Z

Hi @PeterSH6 @eric-haibin-lin, I am exploring this project recently and I am happy to contribute to it. Are there any recommended good first issues that I can work on to get myself familiar with it? I am interested in issues related to model parallelism and serving.

eric-haibin-lin · 2025-02-05T19:47:04Z

Hi @PeterSH6 @eric-haibin-lin, I am exploring this project recently and I am happy to contribute to it. Are there any recommended good first issues that I can work on to get myself familiar with it? I am interested in issues related to model parallelism and serving.

Hi @leifeng666 thanks for your interest. One good issue is to support more models with model parallelism / Megatron. Currently we have an example of Megatron deepseek-llm. I think a good path towards more model support would be:

meta-llama/Meta-Llama-3-8B-Instruct, since its config is quite similar to that of deepseek-llm
meta-llama/Llama-3.2, by support features such as rope

You can find reference Megatron commands and logs in #210 . Other ways to verify the correctness would be just compare Megatron tensor parallel runs with FSDP runs.
You're welcome to join wechat/slack channel to further discuss and contribute!

https://verl.readthedocs.io/en/latest/advance/megatron_extension.html

zhe-thoughts · 2025-02-15T01:38:21Z

Support TorchTitan for TP + PP parallelism

Is there a timeline for this? Thanks

PeterSH6 pinned this issue Nov 22, 2024

PeterSH6 self-assigned this Jan 19, 2025

ocss884 mentioned this issue Mar 5, 2025

add SGLang as rollout engine to verl #490

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] veRL Development Roadmap #22

[Roadmap] veRL Development Roadmap #22

PeterSH6 commented Nov 22, 2024 •

edited

Loading

leifeng666 commented Feb 1, 2025

eric-haibin-lin commented Feb 5, 2025 •

edited

Loading

zhe-thoughts commented Feb 15, 2025

[Roadmap] veRL Development Roadmap #22

[Roadmap] veRL Development Roadmap #22

Comments

PeterSH6 commented Nov 22, 2024 • edited Loading

Themes

Broad Model Support

Regular Update

More RL Algorithms Support

Dataset Coverage

Plugin Support

Scaling up RL

More LLM Infrastructure Support

LLM Training Infrastructure

LLM Serving Infrastructure

Wide Hardware Coverage

leifeng666 commented Feb 1, 2025

eric-haibin-lin commented Feb 5, 2025 • edited Loading

zhe-thoughts commented Feb 15, 2025

PeterSH6 commented Nov 22, 2024 •

edited

Loading

eric-haibin-lin commented Feb 5, 2025 •

edited

Loading