FSDP2 root level parameter management #1091

dingqingy · 2025-04-11T01:54:57Z

Hi,

I am curious about the design decision of managing both token embeddings and the final output layer at the root fsdp level instead of treating them as different layers like other transformer blocks?

This coupled management seems to unshard the final output layer too early and reshard the token embedding too late in forward for example.

Also for the optimization (see here) that disables reshard_after_forward for the last transformer block layer, would it be more appropriate to perform this optimization on the final linear layer instead of the last transformer block?

Thanks!

The text was updated successfully, but these errors were encountered:

awgu · 2025-04-11T04:01:41Z

#382 is probably closer to ideal wrapping. I agree that separately wrapping embeddings and final output linear is more efficient. cc: @tianyu-l if he wants to change it.

tianyu-l · 2025-04-11T21:37:33Z

We can do that!
One question for @awgu : if we use reshard_after_forward=False for the [norm, output], do we still need reshard_after_forward=False for the last transformer block?

awgu · 2025-04-11T21:45:36Z

@tianyu-l I think we can get rid of the reshard_after_forward=False for last transformer block. I think it increases peak memory slightly, and I saw several places copy it from torchtitan 😓

As title. Set reshard_after_forward=False for last layer to avoid gather right after reshard. Similar to llama as discussed in #1091.

tianyu-l added module: fsdp question Further information is requested labels Apr 11, 2025

tianyu-l self-assigned this Apr 11, 2025

tianyu-l linked a pull request Apr 11, 2025 that will close this issue

improve reshard_after_forward logic #1094

Open

wwwjn mentioned this issue Apr 12, 2025

[Flux] Improve reshare_after_forward for Flux model's last layer #1097

Merged

wwwjn added a commit that referenced this issue Apr 13, 2025

[Flux] Improve reshare_after_forward for Flux model's last layer (#1097)

132b1ee

As title. Set reshard_after_forward=False for last layer to avoid gather right after reshard. Similar to llama as discussed in #1091.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSDP2 root level parameter management #1091

FSDP2 root level parameter management #1091

dingqingy commented Apr 11, 2025

awgu commented Apr 11, 2025

tianyu-l commented Apr 11, 2025

awgu commented Apr 11, 2025

FSDP2 root level parameter management #1091

FSDP2 root level parameter management #1091

Comments

dingqingy commented Apr 11, 2025

awgu commented Apr 11, 2025

tianyu-l commented Apr 11, 2025

awgu commented Apr 11, 2025