@@ -200,7 +200,8 @@ Distributed modes
200
200
Lightning allows multiple ways of training
201
201
202
202
- Data Parallel (`distributed_backend='dp' `) (multiple-gpus, 1 machine)
203
- - DistributedDataParallel (`distributed_backend='ddp' `) (multiple-gpus across many machines).
203
+ - DistributedDataParallel (`distributed_backend='ddp' `) (multiple-gpus across many machines (python script based)).
204
+ - DistributedDataParallel (`distributed_backend='ddp_spawn' `) (multiple-gpus across many machines (spawn based)).
204
205
- DistributedDataParallel 2 (`distributed_backend='ddp2' `) (dp in a machine, ddp across machines).
205
206
- Horovod (`distributed_backend='horovod' `) (multi-machine, multi-gpu, configured at runtime)
206
207
- TPUs (`tpu_cores=8|x `) (tpu or TPU pod)
@@ -253,6 +254,26 @@ Distributed Data Parallel
253
254
# train on 32 GPUs (4 nodes)
254
255
trainer = Trainer(gpus = 8 , distributed_backend = ' ddp' , num_nodes = 4 )
255
256
257
+ This Lightning implementation of ddp calls your script under the hood multiple times with the correct environment
258
+ variables. If your code does not support this (ie: jupyter notebook, colab, or a nested script without a root package),
259
+ use `dp ` or `ddp_spawn `
260
+
261
+ .. code-block :: bash
262
+
263
+ # example for 3 GPUs ddp
264
+ MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=0 python my_file.py --gpus 3 --etc
265
+ MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=1 LOCAL_RANK=0 python my_file.py --gpus 3 --etc
266
+ MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=2 LOCAL_RANK=0 python my_file.py --gpus 3 --etc
267
+
268
+ The reason we use ddp this way is because `ddp_spawn ` has a few limitations (because of Python and PyTorch):
269
+
270
+ 1. Since `.spawn() ` trains the model in subprocesses, the model on the main process does not get updated.
271
+ 2. Dataloader(num_workers=N) where N is large bottlenecks training with ddp...
272
+ ie: it will be VERY slow or not work at all. This is a PyTorch limitation.
273
+ 3. Forces everything to be picklable.
274
+
275
+ However, if you don't mind these limitations, please use `ddp_spawn `.
276
+
256
277
Distributed Data Parallel 2
257
278
^^^^^^^^^^^^^^^^^^^^^^^^^^^
258
279
In certain cases, it's advantageous to use all batches on the same machine instead of a subset.
@@ -275,6 +296,75 @@ In this case, we can use ddp2 which behaves like dp in a machine and ddp across
275
296
# train on 32 GPUs (4 nodes)
276
297
trainer = Trainer(gpus = 8 , distributed_backend = ' ddp2' , num_nodes = 4 )
277
298
299
+ Distributed Data Parallel Spawn
300
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
301
+ `ddp_spawn ` is exactly like `ddp ` except that it uses .spawn to start the training processes.
302
+
303
+ .. warning :: It is STRONGLY recommended to use `ddp` for speed and performance.
304
+
305
+ .. code-block :: python
306
+
307
+ mp.spawn(self .ddp_train, nprocs = self .num_processes, args = (model, ))
308
+
309
+ Here's how to call this.
310
+
311
+ .. code-block :: python
312
+
313
+ # train on 8 GPUs (same machine (ie: node))
314
+ trainer = Trainer(gpus = 8 , distributed_backend = ' ddp' )
315
+
316
+ Use this method if your script does not support being called from the command line (ie: it is nested without a root
317
+ project module). However, we STRONGLY discourage this use because it has limitations (because of Python and PyTorch):
318
+
319
+ 1. The model you pass in will not update. Please save a checkpoint and restore from there.
320
+ 2. Set Dataloader(num_workers=0) or it will bottleneck training.
321
+
322
+ `ddp ` is MUCH faster than `ddp_spawn `. We recommend you install a top-level module for your project using setup.py
323
+
324
+ .. code-block :: python
325
+
326
+ # setup.py
327
+ # !/usr/bin/env python
328
+
329
+ from setuptools import setup, find_packages
330
+
331
+ setup(name = ' src' ,
332
+ version = ' 0.0.1' ,
333
+ description = ' Describe Your Cool Project' ,
334
+ author = ' ' ,
335
+ author_email = ' ' ,
336
+ url = ' https://github.com/YourSeed' , # REPLACE WITH YOUR OWN GITHUB PROJECT LINK
337
+ install_requires = [
338
+ ' pytorch-lightning'
339
+ ],
340
+ packages = find_packages()
341
+ )
342
+
343
+ Then setup your project like so:
344
+
345
+ .. code-block :: bash
346
+
347
+ /project
348
+ /src
349
+ some_file.py
350
+ /or_a_folder
351
+ setup.py
352
+
353
+ Then install as a root-level package
354
+
355
+ .. code-block :: bash
356
+
357
+ cd /project
358
+ pip install -e .
359
+
360
+ Now you can call your scripts anywhere
361
+
362
+ .. code-block :: bash
363
+
364
+ cd /project/src
365
+ python some_file.py --distributed_backend ' ddp' --gpus 8
366
+
367
+
278
368
Horovod
279
369
^^^^^^^
280
370
`Horovod <http://horovod.ai >`_ allows the same training script to be used for single-GPU,
@@ -516,3 +606,23 @@ And then launch the elastic job with:
516
606
517
607
See the official `PytorchElastic documentation <https://pytorch.org/elastic >`_ for details
518
608
on installation and more use cases.
609
+
610
+ Jupyter Notebooks
611
+ -----------------
612
+ Unfortunately any `ddp_ ` is not supported in jupyter notebooks. Please use `dp ` for multiple GPUs. This is a known
613
+ Jupyter issue. If you feel like taking a stab at adding this support, feel free to submit a PR!
614
+
615
+ Pickle Errors
616
+ --------------
617
+ Multi-GPU training sometimes requires your model to be pickled. If you run into an issue with pickling
618
+ try the following to figure out the issue
619
+
620
+ .. code-block :: python
621
+
622
+ import pickle
623
+
624
+ model = YourModel()
625
+ pickle.dumps(model)
626
+
627
+ However, if you use `ddp ` the pickling requirement is not there and you should be fine. If you use `ddp_spawn ` the
628
+ pickling requirement remains. This is a limitation of Python.
0 commit comments