You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Having the ability to support multiple quorums on the same lighthouse server would make it much easier to deploy torchft in certain scenarios.
With this feature you could deploy a single lighthouse server and then use it for all jobs running in that cluster by using the job ID. This simplifies discovery and would work with most batch job schedulers.
Two possible designs:
1. room_id outside of GRPC
We likely want to make it so you can create a new Lighthouse client with a certain key and it'll automatically isolate the requests to that namespace.
Implementing this cleanly on the server side may be a bit tricky -- we may need to do some manipulation under the hood to instantiate one LighthouseServer instance per incoming request to make this cleaner and avoid polluting the API with "room" ids.
This may be simpler in some ways to implement as we can just add a room_id field to all the lighthouse requests and then internally route as necessary. No magic with GRPC services is required.
The text was updated successfully, but these errors were encountered:
Having the ability to support multiple quorums on the same lighthouse server would make it much easier to deploy torchft in certain scenarios.
With this feature you could deploy a single lighthouse server and then use it for all jobs running in that cluster by using the job ID. This simplifies discovery and would work with most batch job schedulers.
Two possible designs:
1. room_id outside of GRPC
We likely want to make it so you can create a new Lighthouse client with a certain key and it'll automatically isolate the requests to that namespace.
Implementing this cleanly on the server side may be a bit tricky -- we may need to do some manipulation under the hood to instantiate one LighthouseServer instance per incoming request to make this cleaner and avoid polluting the API with "room" ids.
https://github.com/pytorch/torchft/blob/main/src/lighthouse.rs#L601
1. room_id as field on heartbeat + quorum methods
This may be simpler in some ways to implement as we can just add a room_id field to all the lighthouse requests and then internally route as necessary. No magic with GRPC services is required.
The text was updated successfully, but these errors were encountered: