Using ignite with Megatron-style model-parallel PyTorch modules #1709

g-karthik · 2021-02-26T08:22:42Z

❓ Questions/Help/Support

This is a somewhat general question, but I'd love a detailed response. When wanting to go beyond standard data-parallel training towards hybrid data+model-parallel training (like Megatron-LM), what are some ignite abstractions to use and avoid?

@vfdev-5

vfdev-5 · 2021-02-26T08:35:39Z

@g-karthik thanks for an interesting question! I haven't yet explored this hybrid data+model-parallel trainings and would love to test that.

@sdesrozis any thoughts ?
@Nic-Ma have you tried that in MONAI ?

Nic-Ma · 2021-02-26T13:14:22Z

Hi @vfdev-5 ,

MONAI has a model-parallel tutorial: https://github.com/Project-MONAI/research-contributions/tree/master/lamp-automated-model-parallelism
But I think it's not based on ignite workflow.

Thanks.

sdesrozis · 2021-02-26T20:40:32Z

I didn't yet experienced model parallel training. I would be very pleased to explore this topic.

sdesrozis · 2021-02-26T21:30:06Z

My first thoughts if we just consider model parallel on 2 GPUs

engine is agnostic to device
x, y and y_pred are on different devices. You can't use create_supervised_xxx because data are moved on same device...
metrics should be ok because it relies on output of update function. If you write your own function, it should work.
auto_model from idist could not work because if multiple GPUs are detected, DataParallel is used...
I think that checkpoint and loggers should work but can't be 100% sure...

We first should test this before try hybrid data+model parallelism.

@g-karthik could you explain how you think distribute your model and data in that case ? Thanks in advance.

vfdev-5 · 2021-02-26T22:28:32Z

@sdesrozis take a look at https://www.deepspeed.ai/tutorials/pipeline/ and https://www.deepspeed.ai/tutorials/megatron/ and the example.

I think in addition to what @sdesrozis said, ignite.distributed module wont be aware of the "topology". It implicitly considers data parallel only axis. In the worst case, this can lead to hangs while all reducing metrics...

sdesrozis · 2021-02-27T07:29:36Z

@vfdev-5 That's exactly what I was thinking about the collective ops in metrics.

vfdev-5 · 2021-03-01T08:17:46Z

@g-karthik @sdesrozis I'm working on how to make ignite distributed aware of particular data parallel configuration. I'll push soon a draft PR with new API and example using DeepSpeed.

g-karthik added the question label Feb 26, 2021

vfdev-5 added enhancement module: distributed Distributed module labels Mar 1, 2021

vfdev-5 mentioned this issue Mar 2, 2021

Support for pipeline/model parallelism #1729

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using ignite with Megatron-style model-parallel PyTorch modules #1709

Using ignite with Megatron-style model-parallel PyTorch modules #1709

g-karthik commented Feb 26, 2021

vfdev-5 commented Feb 26, 2021 •

edited

Loading

Nic-Ma commented Feb 26, 2021

sdesrozis commented Feb 26, 2021

sdesrozis commented Feb 26, 2021 •

edited

Loading

vfdev-5 commented Feb 26, 2021 •

edited

Loading

sdesrozis commented Feb 27, 2021

vfdev-5 commented Mar 1, 2021

Using ignite with Megatron-style model-parallel PyTorch modules #1709

Using ignite with Megatron-style model-parallel PyTorch modules #1709

Comments

g-karthik commented Feb 26, 2021

❓ Questions/Help/Support

vfdev-5 commented Feb 26, 2021 • edited Loading

Nic-Ma commented Feb 26, 2021

sdesrozis commented Feb 26, 2021

sdesrozis commented Feb 26, 2021 • edited Loading

vfdev-5 commented Feb 26, 2021 • edited Loading

sdesrozis commented Feb 27, 2021

vfdev-5 commented Mar 1, 2021

vfdev-5 commented Feb 26, 2021 •

edited

Loading

sdesrozis commented Feb 26, 2021 •

edited

Loading

vfdev-5 commented Feb 26, 2021 •

edited

Loading