You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/source/multi_gpu.rst
+34
Original file line number
Diff line number
Diff line change
@@ -367,3 +367,37 @@ The reason is that the full batch is visible to all GPUs on the node when using
367
367
368
368
.. note:: Huge batch sizes are actually really bad for convergence. Check out:
369
369
`Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour <https://arxiv.org/abs/1706.02677>`_
370
+
371
+
PytorchElastic
372
+
--------------
373
+
Lightning supports the use of PytorchElastic to enable fault-tolerent and elastic distributed job scheduling. To use it, specify the 'ddp' or 'ddp2' backend and the number of gpus you want to use in the trainer.
374
+
375
+
.. code-block:: python
376
+
377
+
Trainer(gpus=8, distributed_backend='ddp')
378
+
379
+
380
+
Following the `PytorchElastic Quickstart documentation <https://pytorch.org/elastic/0.2.0/quickstart.html>`_, you then need to start a single-node etcd server on one of the hosts:
0 commit comments