-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: bad value(s) in fds_to_keep, when attemping DDP #538
Comments
I suspect you're right, it probably is your model. If that doesn't work, I have a change that allows you to bypass mp.spawn and use |
@jeffling should we add a hook for the mp.spawn call so ppl can do whatever they want? |
@williamFalcon Absolutely, I think that would be the right thing to do. This is the relevant part of the patched fit function version that we are running, in case it helps anyone else reading this.
We could have a callback that allows people to override everything under I also think that using a process per GPU has some real benefits, and it would be good for the framework to support it out of the box, so we can allow for callback override and also support this usecase in the default callback :) |
Just in case people want to use the snippet before we support this officially: You need to set Trainer.gpus to your world_size, and Trainer.distributed_backend to In your module, you need the following overrides as well:
|
@jeffling any tips on bypassing the mp.spawn and using the pytorch's |
@armancohan You can use the code snippets and instructions in my two comments above, and then just use the |
Thanks @jeffling, this was very helpful. |
Just to understand @jeffling - currently pytorch lightning does NOT work with 'ddp' out of the box. That's because the framework doesn't propagate the seed value set at the initial process to the spawned processes (Therefore, each of the processes in the distributed training may have a different seed..). |
@kwanUm You can set the seed value inside your lightning module as well. There are a few things that lightning can't or doesn't currently do, but a lot of these are dependant on your model and data:
|
I have same issue, can't make Proposed solution by @jeffling have no effect on my machine. Any idea would be highly appreciated. |
Same here. Firstly I couldn't understand what's the reason. It'll be very appreciated if someone can explain the meaning of the error. |
@colanim @kyoungrok0517 may you give details about you machines and environments? |
@Borda Here's my environment
|
@jeffling may you check? |
I'm experiencing the same problem in any multi-gpu environment, not only with two Titan Vs. |
@kyoungrok0517 @colanim Just as a sanity check, do you have the same issues with running any model with lightning? Also, exactly which arguments are passing in? |
No. The model was running well on multi-gpu in early stage but from some # data loading (pandas, parquet)
class News20Dataset(Dataset):
def __init__(self, data_path, tokenizer):
super().__init__()
self.data = pd.read_parquet(data_path)
self.tokenizer = tokenizer
def __len__(self):
return len(self.data)
def __getitem__(self, index):
row = self.data.iloc[index]
text_ids = self.tokenizer.encode(row.text)
return {"text": text_ids, "target": row.label} # data loading (nunpy, zarr)
class TripleDataset(Dataset):
def __init__(self, data_path, tokenizer):
super().__init__()
self.data = zarr.open(data_path, "r")
self.tokenizer = tokenizer # vector loading (torchtext.vocab)
def _get_bow_vocab(self):
VOCAB_PATH = Path(root_dir) / "../../vocab/vocab.json"
VECTORS = "fasttext.en.300d"
MIN_FREQ = 10
MAX_SIZE = 100000
with open(VOCAB_PATH, "r", encoding="utf-8") as f:
vocab_counts = Counter(json.load(f))
return Vocab(
vocab_counts,
vectors=VECTORS,
min_freq=MIN_FREQ,
max_size=MAX_SIZE
) |
Here's my dataloaders def _get_dataloader(self, dataset, test=False):
batch_size = self.hparams.batch_size if not test else 10000
num_workers = int(cpu_count() / 4) or 1
return DataLoader(
dataset, batch_size=batch_size, num_workers=num_workers, pin_memory=True,
)
def train_dataloader(self):
return self._get_dataloader(self._train_dataset)
def val_dataloader(self):
return self._get_dataloader(self._val_dataset)
def test_dataloader(self):
return self._get_dataloader(self._test_dataset, test=True) |
This is my Trainer if hparams.profile:
profiler = AdvancedProfiler()
else:
profiler = None
# train
# trainer = Trainer.from_argparse_args(hparams)
trainer = Trainer(
logger=tt_logger,
default_save_path=root_dir,
max_nb_epochs=hparams.max_nb_epochs,
gpus=hparams.gpus,
distributed_backend=hparams.distributed_backend,
fast_dev_run=hparams.fast_dev_run,
amp_level=hparams.amp_level,
precision=hparams.precision,
early_stop_callback=early_stop_callback,
benchmark=True,
profiler=profiler,
)
trainer.fit(model) |
FWIW: Have you recently updated Ubuntu? I just started experiencing this in the last hour - and I am using a local fork that has not changed in a few weeks - so doesn't seem likely it's lightning. Will add more if I learn more. Update I do not believe this is pytorch-lightning. I have recently minted models that are virtually identical that do NOT show this problem. Not clear what is causing it ... almost certainly file related as the specific error is a multiprocessing/posix complaint about file descriptors that do not have appropriate values. |
Well maybe or maybe not. Then the problem could be CUDA or apex compatibility issue. To add I see the problem only when I use DDP. Using DP shows no problem.
|
I have resolved this in my environment. Writing it here in case it helps. During the construction of the model, and before creating trainer, etc, I store the .data property of a Parameter:
I do NOT actually do anything with self.class_p_t - I assigned it in anticipation of logging these values to monitor progress of self learning curriculum learning class weights. Speculating there is some issue relating to autograd - but just a guess. It generated ~100 file descriptors, which is what appeared to choke For the record:Ubuntu 19.10, python 3.7.5, pytorch 1.4, venv. |
Hmm... are you suggesting it's a bug in pytorch? Is it expected behavior to
generate many file descriptors if we extract `.data` property from tensors?
For temporary storage?? I couldn't understand...
|
Sorry - I do not know if it's a bug in pytorch or not. I shared my results in case it helps. I am not sure under what circumstances the .data field is to be accessed - if ever. I don't even know if its just revealing a bug in the latest ubuntu release. I just know that removing that line made the problem go away. As I tried to say, I was merely speculating that this was due to some behind the scenes memory sharing approach from python/pytorch multiprocessing - of which memory files could be a useful model for large tensor sharing. Again, this is ONLY speculation. I did post a question on pytorch github. |
The people at pytorch did look at a fragment that reproduces this problem - and they had some insight into what is happening and why. I need to look at some lightning internals to make sure their diagnosis applies to what we are seeing. |
I have verified what causes the problem in my model and what will fix it. The problem is my naively assigning a parameter to another variable. This new reference to the parameter does not get moved to the correct gpu when pytorch-lightning copies it with This is NOT a ptl bug. This is the result of the naive assignment of the parameter to another variable:
Hope this helps ... |
thanks! |
@williamFalcon mrshenli's comments in pytorch do raise a question for me - he points out that something similar could happen when the model is passed as an arg to a ddp process. I think ptl is probably okay because the Also, he inadvertently partially demonstrates something I have been meaning to try in bringing a model back to spawning process from ddp - that is to use the special way in which pytorch handles tensors/models on queues. I suspect if we used a queue() to pass the model to process on gpus[0], the models parameters may be automatically resolved back to cpu - and thus the trained model is available without any special effort. I will try to get to this in the next week ... |
Not sure if I'm late to the game, but this might be a hint to the origin of the problem. |
@LukasHedegaard pls update to 0.9 :] |
Unfortunately, I get the same results after updating to pytorch-lightning 0.9 |
@williamFalcon @sneiman copy.deepcopy solution works but it uses lot of extra memory as also said in the documentation https://docs.python.org/3/library/copy.html |
For anyone still struggling with this, the issue was fixed for me by switching strategy from ddp_spawn to DDPStrategy(): I was only seeing the issue when including a validation step in trainer.fit(). DDPStrategy() resolved the issue. |
Changing the strategy worked for me. Here's the code from the doc for someone troubling with this like me. # Training with the DistributedDataParallel strategy on 4 GPUs
trainer = Trainer(strategy="ddp", accelerator="gpu", devices=4) |
I can't get DDP working without getting the following error:
What I have tried and that didn't work:
The error occurs on two servers I tried on, one with 4 Titan X cards and one with 8 Tesla V100 running Ubuntu 18.04.3 LTS.
I suspect that something in my model is triggering it and would appreciate ideas. I can not share the source code though. The model works in dp and single gpu mode.
The text was updated successfully, but these errors were encountered: