-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend docs with multiple dataloader with common cases #1089
Comments
Hi! thanks for your contribution!, great first issue! |
Good point having support also for multiple training dataloders would be great, mind send a PR? |
I'm interested in this task, but I have some questions. 1- Do we assume the data loaders are of the same length? What should we do if one runs out of data. 3- Would a more sensible design be: def training_step(self, batch, batch_idx:int, dataloader_idx: int):
if dataloader_idx == 0:
# Supervised loss for example
elif dataloader == 1:
# Unsupervised loss
... |
Thanks for all the replies. To @Dref360,
|
I found a related discussion in here. The first reply provided a solution for multi-datasets using Therefore, I modified the provided codes to be more flexible such as follows, class CustomDataset(Dataset):
def __init__(self, datasets):
self.datasets = datasets
self.map_indexes = [[] for _ in self.datasets]
self.min_length = min(len(d) for d in self.datasets)
self.max_length = max(len(d) for d in self.datasets)
def __getitem__(self, i):
return tuple(d[m[i]] for d, m in zip(self.datasets, self.map_indexes))
def construct_map_index(self):
def update_indices(original_indexes, target_len, max_len):
# map max_len to target_len (large to small)
# return: a list, which maps the range(max_len) to the valid index in the dataset
original_indexes = original_indexes[max_len:] # remove used indices
fill_num = max_len - len(original_indexes)
batch = fill_num // target_len
if fill_num % target_len != 0:
# to let the fill_num + len(original_indexes) greater than max_len
batch += 1
additional_indexes = list(range(target_len)) * batch
random.shuffle(additional_indexes)
original_indexes += additional_indexes
assert len(original_indexes) >= max_len, "the length of matcing indexes is too small"
return original_indexes
self.map_indexes = [update_indices(m, len(d), self.max_length)
for m, d in zip(self.map_indexes, self.datasets)]
def __len__(self):
# will be called every epoch
self.construct_map_index()
return self.max_length In this case, the indexes of the Construct one from torch.utils.data import TensorDataset
dataset_1 = TensorDataset(torch.arange(2))
dataset_2 = TensorDataset(torch.arange(3, 8))
dataset = CustomDataset([dataset_1, dataset_2])
dataloader = DataLoader(dataset, batch_size=3, shuffle=True)
for epoch in range(3):
for batch in dataloader:
print(batch) Outputs
The primary deficiency of the codes is the batch sizes of datasets will be the same and might be a little bit hard to read for users. I hope this is helpful for you to develop the feature! |
@williamFalcon @tullie pls ^^ |
ssss ssss ssss ssss |
Agreed that in this case the custom dataloader with two datasets seems best. Pytorch's dataloader/dataset classes are flexible enough that the user can control exactly what is coming out of them at each epoch (including which node they go to) and batch size. |
why don’t we make the output of this a common use case page? Add a new page for multiple dataloaders
|
Then why we have multiple dataloaders for test and valid? Just feeling like a bit puzzled... |
I totally agree with the idea about the new doc for data loaders.
Is it because we would like to extract the data from multiple datasets simultaneously in the training phase, while we usually sequentially loop datasets in the validation/testing phase (like evaluation step)? |
exactly. i could be wrong, but in training we usually want to use both batches at once. in val/test we use them sequentially |
In semi-supervised learning, domain adaptation, consistency training, etc it is typical that one uses the samples from different loaders in the same training step to compute various cross-losses. Thus, alternating behaviour of the training step does not bring much usability improvement. |
maybe the way to go is to support multiple dataloaders and add a way (maybe an arg) to decide whether it should be sequential or simultaneous. if simultaneous, lightning auto loops or truncates to the shorter length? |
Quick fix to get different batch size on labeled and unlabeled dataloaders during training might be: def prepare_data(self):
...
self.train_unlabeled_dataloader = torch.utils.data.DataLoader(train_unlabeled_dataset, ...)
self.train_unlabeled_dataloader_iterator = iter(self.train_unlabeled_dataloader)
...
def training_step(self, batch, batch_idx):
inputs_x, targets = batch
try:
unlabeled_x, _ = next(self.train_unlabeled_dataloader_iterator)
except StopIteration:
self.train_unlabeled_dataloader_iterator = iter(self.train_unlabeled_dataloader)
unlabeled_x, _ = next(self.train_unlabeled_dataloader_iterator)
unlabeled_x = unlabeled_x.type_as(inputs_x)
... But as @soupault said, it will be much more convenient to have multiple train dataloaders. |
In our active learning library baal, we are currently trying to come up with a solution to the same problem. In our case, one of the DataLoader will be massively larger than the other. In consequence, we added some optional features:
Those two features are optional and if they are not provided, we only alternate between the two loaders. We provide an implementation in this gist: https://gist.github.com/Dref360/2524e524244569ed47428f19c487f264 I would appreciate your feedback! Thank you! |
I see that #1416 has been merged. Should we close this as well? If we want to make this a new feature, I think we have 3 cases to support
Could we propose those three cases as Iterator and the user would pick one?
Or we add an argument:
I would be happy to work on this as soon as we reach a decision :) |
This was added here. Closing. |
Hi, sorry to re-open this but I'm facing this precise problem currently. I'd like to sample continously from multiple dataloader, not have batches contain |
In fact, ideally this would work for |
I notice that one can evaluate the model on a list of validation/test data loaders. Is it also possible to extract data from multiple
train_data_loader
in thetraining step
in the current version? This feature might be useful in tasks like transfer learning or semi-supervised learning, which usually maintain multiple datasets in the training stage (e.g., source and target datasets in transfer learning, labeled and unlabeled datasets in semi-supervised learning).It will be nice if one could obtain list of batch data as follow,
The text was updated successfully, but these errors were encountered: