Skip to content

Dynamo error with large mesh + AdamWFp8 + bf16 stochastic rounding #2074

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cassanof opened this issue Apr 18, 2025 · 1 comment
Open

Dynamo error with large mesh + AdamWFp8 + bf16 stochastic rounding #2074

cassanof opened this issue Apr 18, 2025 · 1 comment
Labels
bug Something isn't working distributed optimizer

Comments

@cassanof
Copy link

Hello, I am getting the following error whenever I scale up training to 512 GPUs while using FSDP2 + AdamWFP8 + BF16 stochastic rounding:

  torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_method copy_(*(DTensor(local_tensor=FakeTensor(..., device='cuda:4', size=(253,
7168), dtype=torch.bfloat16), device_mesh=DeviceMesh('cuda', [4, 12, 20, 28, 36, 44, 52, 60, 68, 76, 84, 92, 100, 108, 116, 124, 132, 140, 148, 156, 164, 172, 180, 188, 196, 204, 212, 220, 228, 236, 244, 252, 260, 268, 276, 284, 292, 300, 308, 316, 324, 332, 340, 348, 356, 364, 372, 380, 388, 396, 404, 412, 420, 428, 436, 444, 452, 460, 468,
476, 484, 492, 500, 508], mesh_dim_names=('dp_shard_cp',)), placements=(Shard(dim=0),)), DTensor(local_tensor=FakeTensor(..., device='cuda:4', size=(253, 7168), dtype=torch.bfloat16), device_mesh=DeviceMesh('cuda', [4, 12, 20, 28, 36, 44, 52, 60, 68, 76, 84, 92, 100, 108, 116, 124, 132, 140, 148, 156, 164, 172, 180, 188, 196, 204, 212, 220, 228, 236, 244, 252, 260, 268, 276, 284, 292, 300, 308, 316, 324, 332, 340, 348, 356, 364, 372, 380, 388, 396, 404, 412, 420, 428, 436, 444, 452, 460, 468, 476, 484, 492, 500, 508], mesh_dim_names=('dp_shard_cp',)), placements=(Shard(dim=0),))), **{}): got RuntimeError('expand: attempting to expand a dimension of length 16192!')

  from user code:
     File "/home/federico/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torchao/prototype/low_bit_optim/adam.py", line 189, in single_param_adam
      p.copy_(_fp32_to_bf16_sr(p_f32))

  Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

Either scaling down the run, or using HSDP is a workaround to the problem, but not great.

@janeyx99
Copy link
Contributor

cc @weifengpy @gau-nernst

@supriyar supriyar added distributed optimizer bug Something isn't working labels Apr 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed optimizer
Projects
None yet
Development

No branches or pull requests

3 participants