Support DTensor params in local_sgd/diloco #168

H-Huang · 2025-04-18T19:28:27Z

Tested in torchtitan (pytorch/torchtitan#1122) when the model is sharded with FSDP we need to convert the params to local tensors, update, then back to DTensors.

Ideally I would like to add a test in the integration tests but that would require us to set up the device_mesh / FSDP for a model

torchft/local_sgd.py

d4l3k

LGTM -- it is unfortunate that we have to add special cases to DTensor for this kind of thing. Does it do a full .to_local() call if you don't do this?

H-Huang · 2025-04-21T15:31:36Z

Without the Dtensor conditionals, the operations like .copy_() and pseudogradient calculation fail due to incompatible types. The allreduce also fails because got exception in all reduce -- skipping remaining: found no DeviceMesh from dtensor args for c10d.allreduce_.default!

fegin · 2025-04-21T16:13:29Z

torchft/local_sgd.py


    def _restore_parameters(self) -> None:
        with torch.no_grad():
            # TODO: consider running copy on a separate stream
            for name, p in self._model.named_parameters():
-                p.data.copy_(self.original_parameters[name], non_blocking=False)
+                if isinstance(p, DTensor):


If p is a DTensor, does p.copy_(), instead of p.data.copy_(), work?

IIRC it didn't work and causes a segfault

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 18, 2025

H-Huang requested review from d4l3k, fegin and XilunWu April 18, 2025 19:28

H-Huang force-pushed the diloco_titan branch 2 times, most recently from 9a89a69 to 6ba0eee Compare April 18, 2025 19:41

H-Huang commented Apr 18, 2025

View reviewed changes

torchft/local_sgd.py Outdated Show resolved Hide resolved

H-Huang mentioned this pull request Apr 18, 2025

[FT] Support local_sgd / diloco in titan pytorch/torchtitan#1122

Open

d4l3k approved these changes Apr 18, 2025

View reviewed changes

Support DTensor params in local_sgd/diloco

f597b9f

H-Huang force-pushed the diloco_titan branch from 6ba0eee to f597b9f Compare April 21, 2025 15:28

fegin reviewed Apr 21, 2025

View reviewed changes

H-Huang merged commit 360c5c5 into pytorch:main Apr 21, 2025
10 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support DTensor params in local_sgd/diloco #168

Support DTensor params in local_sgd/diloco #168

H-Huang commented Apr 18, 2025

d4l3k left a comment

H-Huang commented Apr 21, 2025

fegin Apr 21, 2025

H-Huang Apr 21, 2025

Support DTensor params in local_sgd/diloco #168

Support DTensor params in local_sgd/diloco #168

Conversation

H-Huang commented Apr 18, 2025

d4l3k left a comment

Choose a reason for hiding this comment

H-Huang commented Apr 21, 2025

fegin Apr 21, 2025

Choose a reason for hiding this comment

H-Huang Apr 21, 2025

Choose a reason for hiding this comment