DeepSpeed support for ignite.distributed #2008

Kashu7100 · 2021-05-20T15:27:45Z

🚀 Feature

Pytorch lightning recently added native support for MS DeepSpeed.

I believe it is also helpful for users if ignite incorporates the DeepSpeed pipeline for memory-efficient distributed training.

1. for idist.auto_model ..?

To initialize the DeepSpeed engine:

model_engine, optimizer, _, _ = deepspeed.initialize(args=cmd_args,
                                                     model=model,
                                                     model_parameters=params)

And for distributed environment setup, we need to replace torch.distributed.init_process_group(...) to deepspeed.init_distributed()

2. checkpoint handler

slightly different thing for checkpointing

model_engine.save_checkpoint(args.save_dir, ckpt_id, client_sd = client_sd)

The text was updated successfully, but these errors were encountered:

sdesrozis · 2021-05-20T15:46:02Z

@Kashu7100 Thank you for this suggestion!

I confirm that it would be very nice to support DeepSpeed with idist. Maybe a new backend could be introduced, what do you think @vfdev-5 and @fco-dv ?

Currently we have docker environment configured with MS DeepSpeed.

https://github.com/pytorch/ignite/tree/master/docker/msdp

Would you like to contribute on this ? It seems you already know how to do it 😉

Kashu7100 · 2021-05-21T04:18:50Z

@sdesrozis Do you think it is possible to reuse idist.Parallel pipeline without modifications?

with idist.Parallel(backend=backend, **spawn_kwargs) as parallel:
        parallel.run(main, config)

sdesrozis · 2021-05-21T05:03:26Z

It depends on what you want to do. The features list of msdp is quite long and there are more or less deep impacts.

For instance, I think that the pipeline parallelism would be a very nice feature to have but not trivial to adapt.

Maybe a first step could be the distributed parallelism using the simplified api as you mentioned. Thus, it may be a new backend to develop and integrate in our idist.Parallel.

You can have a look here. Btw, it's not an easy task and maybe I'm wrong about what to do. @vfdev-5 was looking further on this, maybe he could help in the discussion.

sdesrozis · 2021-05-21T06:50:54Z

@Kashu7100 Finally, introducing a new backend does not seem to be the good option. Have a look here, and you would see that native PyTorch distributed is used when distributed environment variables are set.

That is a good news for simple use cases.

@sdesrozis Do you think it is possible to reuse idist.Parallel pipeline without modifications?

I would say yes.

vfdev-5 · 2021-05-21T08:06:08Z

@Kashu7100 thanks for the feature request!

Yes, we plan to improve our support of deepspeed framework which is roughly:

cmd line launcher + config file
model_engine wrapper
various modern optimizers
pipeline parallelism
amp using nvidia/apex
customized distributed (support azure) on top of torch distributed

Our idea was to provide basic integration examples of how to use ignite and deepspeed together. I looked at it multiple times and due to certain overlap between the framework it was not obvious where to put the split.

@sdesrozis I'm not sure whether we should add it as a new backend or not. Let's first create basic integration example and see which part of DeepSpeed code could be simplified using idist.

sdesrozis · 2021-05-21T08:50:28Z

customized distributed (support azure) on top of torch distributed

I think this could be integrated in our native backend, beside slurm.

@sdesrozis I'm not sure whether we should add it as a new backend or not.

IMO it is not necessary.

Let's first create basic integration example and see which part of DeepSpeed code could be simplified using idist.

That is a good option. As discussed a few weeks ago, the specific engine should be the tricky part. Otherwise, auto helpers could do the job. I suppose.

saifullah3396 · 2023-07-19T13:34:47Z

Hi, is there any update on this?

vfdev-5 · 2023-07-21T11:21:47Z

@saifullah3396 well this feature is not really a priority right now. If you would like to help with, we can guide your development from ignite side.

vfdev-5 added the enhancement label May 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSpeed support for ignite.distributed #2008

DeepSpeed support for ignite.distributed #2008

Kashu7100 commented May 20, 2021

sdesrozis commented May 20, 2021 •

edited

Loading

Kashu7100 commented May 21, 2021

sdesrozis commented May 21, 2021 •

edited

Loading

sdesrozis commented May 21, 2021 •

edited

Loading

vfdev-5 commented May 21, 2021

sdesrozis commented May 21, 2021

saifullah3396 commented Jul 19, 2023

vfdev-5 commented Jul 21, 2023 •

edited

Loading

DeepSpeed support for ignite.distributed #2008

DeepSpeed support for ignite.distributed #2008

Comments

Kashu7100 commented May 20, 2021

🚀 Feature

1. for idist.auto_model ..?

2. checkpoint handler

sdesrozis commented May 20, 2021 • edited Loading

Kashu7100 commented May 21, 2021

sdesrozis commented May 21, 2021 • edited Loading

sdesrozis commented May 21, 2021 • edited Loading

vfdev-5 commented May 21, 2021

sdesrozis commented May 21, 2021

saifullah3396 commented Jul 19, 2023

vfdev-5 commented Jul 21, 2023 • edited Loading

sdesrozis commented May 20, 2021 •

edited

Loading

sdesrozis commented May 21, 2021 •

edited

Loading

sdesrozis commented May 21, 2021 •

edited

Loading

vfdev-5 commented Jul 21, 2023 •

edited

Loading