Skip to content

Commit 62679bb

Browse files
tullieBorda
authored andcommitted
Add ElasticTraining documentation (#1818)
(cherry picked from commit fddd618)
1 parent 694f1d7 commit 62679bb

File tree

1 file changed

+34
-0
lines changed

1 file changed

+34
-0
lines changed

docs/source/multi_gpu.rst

+34
Original file line numberDiff line numberDiff line change
@@ -367,3 +367,37 @@ The reason is that the full batch is visible to all GPUs on the node when using
367367
368368
.. note:: Huge batch sizes are actually really bad for convergence. Check out:
369369
`Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour <https://arxiv.org/abs/1706.02677>`_
370+
371+
PytorchElastic
372+
--------------
373+
Lightning supports the use of PytorchElastic to enable fault-tolerent and elastic distributed job scheduling. To use it, specify the 'ddp' or 'ddp2' backend and the number of gpus you want to use in the trainer.
374+
375+
.. code-block:: python
376+
377+
Trainer(gpus=8, distributed_backend='ddp')
378+
379+
380+
Following the `PytorchElastic Quickstart documentation <https://pytorch.org/elastic/0.2.0/quickstart.html>`_, you then need to start a single-node etcd server on one of the hosts:
381+
382+
.. code-block:: bash
383+
384+
etcd --enable-v2
385+
--listen-client-urls http://0.0.0.0:2379,http://127.0.0.1:4001
386+
--advertise-client-urls PUBLIC_HOSTNAME:2379
387+
388+
389+
And then launch the elastic job with:
390+
391+
.. code-block:: bash
392+
393+
python -m torchelastic.distributed.launch
394+
--nnodes=MIN_SIZE:MAX_SIZE
395+
--nproc_per_node=TRAINERS_PER_NODE
396+
--rdzv_id=JOB_ID
397+
--rdzv_backend=etcd
398+
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
399+
YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...)
400+
401+
402+
See the official `PytorchElastic documentation <https://pytorch.org/elastic/0.2.0/index.html>`_ for details
403+
on installation and more use cases.

0 commit comments

Comments
 (0)