Dynamo error with large mesh + AdamWFp8 + bf16 stochastic rounding #2074

cassanof · 2025-04-18T06:19:14Z

Hello, I am getting the following error whenever I scale up training to 512 GPUs while using FSDP2 + AdamWFP8 + BF16 stochastic rounding:

  torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_method copy_(*(DTensor(local_tensor=FakeTensor(..., device='cuda:4', size=(253,
7168), dtype=torch.bfloat16), device_mesh=DeviceMesh('cuda', [4, 12, 20, 28, 36, 44, 52, 60, 68, 76, 84, 92, 100, 108, 116, 124, 132, 140, 148, 156, 164, 172, 180, 188, 196, 204, 212, 220, 228, 236, 244, 252, 260, 268, 276, 284, 292, 300, 308, 316, 324, 332, 340, 348, 356, 364, 372, 380, 388, 396, 404, 412, 420, 428, 436, 444, 452, 460, 468,
476, 484, 492, 500, 508], mesh_dim_names=('dp_shard_cp',)), placements=(Shard(dim=0),)), DTensor(local_tensor=FakeTensor(..., device='cuda:4', size=(253, 7168), dtype=torch.bfloat16), device_mesh=DeviceMesh('cuda', [4, 12, 20, 28, 36, 44, 52, 60, 68, 76, 84, 92, 100, 108, 116, 124, 132, 140, 148, 156, 164, 172, 180, 188, 196, 204, 212, 220, 228, 236, 244, 252, 260, 268, 276, 284, 292, 300, 308, 316, 324, 332, 340, 348, 356, 364, 372, 380, 388, 396, 404, 412, 420, 428, 436, 444, 452, 460, 468, 476, 484, 492, 500, 508], mesh_dim_names=('dp_shard_cp',)), placements=(Shard(dim=0),))), **{}): got RuntimeError('expand: attempting to expand a dimension of length 16192!')

  from user code:
     File "/home/federico/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torchao/prototype/low_bit_optim/adam.py", line 189, in single_param_adam
      p.copy_(_fp32_to_bf16_sr(p_f32))

  Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

Either scaling down the run, or using HSDP is a workaround to the problem, but not great.

The text was updated successfully, but these errors were encountered:

janeyx99 · 2025-04-18T21:36:36Z

cc @weifengpy @gau-nernst

supriyar added distributed optimizer bug Something isn't working labels Apr 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamo error with large mesh + AdamWFp8 + bf16 stochastic rounding #2074

Dynamo error with large mesh + AdamWFp8 + bf16 stochastic rounding #2074

cassanof commented Apr 18, 2025

janeyx99 commented Apr 18, 2025

Dynamo error with large mesh + AdamWFp8 + bf16 stochastic rounding #2074

Dynamo error with large mesh + AdamWFp8 + bf16 stochastic rounding #2074

Comments

cassanof commented Apr 18, 2025

janeyx99 commented Apr 18, 2025