Skip to content

torchelastic Rendezvous Backend #172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
d4l3k opened this issue Apr 25, 2025 · 0 comments
Open

torchelastic Rendezvous Backend #172

d4l3k opened this issue Apr 25, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@d4l3k
Copy link
Member

d4l3k commented Apr 25, 2025

We want to be able to leverage torchft's fast quorum implementation for Lighthouse in order to do faster dynamic rendezvous for torchelastic.

Torchelastic has an entrypoints based mechanism for registering new backends at https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/rendezvous/registry.py#L63-L64

Key features we want to support:

  • flexible lighthouse config: external lighthouse support + automatically starting lighthouse similar to c10d's TCPStore using the address
  • scale up / scale down operations
  • hot spares for fast restarts

References:

@d4l3k d4l3k added the enhancement New feature or request label Apr 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant