-
Notifications
You must be signed in to change notification settings - Fork 31
Issues: pytorch/torchft
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
LighthouseClient: support heartbeats
enhancement
New feature or request
good first issue
Good for newcomers
lighthouse
Lighthouse and quorum related
#174
opened Apr 25, 2025 by
d4l3k
support multiple quorums on the same LighthouseServer
enhancement
New feature or request
lighthouse
Lighthouse and quorum related
#173
opened Apr 25, 2025 by
d4l3k
torchelastic Rendezvous Backend
enhancement
New feature or request
#172
opened Apr 25, 2025 by
d4l3k
Towards Native Fault Tolerance for Semi-Synchronous Training
enhancement
New feature or request
#171
opened Apr 23, 2025 by
WarrenZhu050413
[lighthouse] fast failure on missing heartbeat instead of timeout
enhancement
New feature or request
lighthouse
Lighthouse and quorum related
#164
opened Apr 15, 2025 by
rualark
Explain quorum
documentation
Improvements or additions to documentation
#163
opened Apr 15, 2025 by
rualark
Example train_ddp.py run command runs only one process, but there are two groups and it is not explained how to use both
documentation
Improvements or additions to documentation
#162
opened Apr 15, 2025 by
rualark
README.md contains train.py, but it is not used and there is no explanation what that is
documentation
Improvements or additions to documentation
#161
opened Apr 15, 2025 by
rualark
Add watchdog to _TimeoutManager+ProcessGroupNCCL to guarantee fast aborts
enhancement
New feature or request
process_group
related to ProcessGroups and collectives
#152
opened Mar 26, 2025 by
d4l3k
2
Add profiling to Manager
enhancement
New feature or request
manager
process_group
related to ProcessGroups and collectives
#137
opened Mar 17, 2025 by
d4l3k
PGTransport in-place transfers
checkpoint
related to checkpointing/recovery/healing
process_group
related to ProcessGroups and collectives
python
#118
opened Feb 22, 2025 by
d4l3k
add profiling to ProcessGroupBaby
process_group
related to ProcessGroups and collectives
python
#116
opened Feb 22, 2025 by
d4l3k
Use bucketized model averaging for LocalSGD
enhancement
New feature or request
good first issue
Good for newcomers
#66
opened Jan 10, 2025 by
d4l3k
Dataloader question upon restart
data
Related to dataloading
enhancement
New feature or request
question
Further information is requested
#58
opened Jan 7, 2025 by
cjolivier01
[dataloader] dataloading improvement tracking issue
data
Related to dataloading
enhancement
New feature or request
#37
opened Dec 12, 2024 by
d4l3k
3 tasks
[CheckpointServer] use streaming transfers
enhancement
New feature or request
good first issue
Good for newcomers
#36
opened Dec 12, 2024 by
d4l3k
ProTip!
Add no:assignee to see everything that’s not assigned.