-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] There should be a Metrics package #973
Comments
Love it! |
So I like the API and agree this would be nice to factor out. However I don't think that lightning should supply the metrics. Outside of classification, this is a really tricky thing to get right and often requires lots of domain engineering. Just as with the dataloaders, you don't want to own them, but facilitate and super clean API. Random, why even specify CrossEntropy in training? Isn't that just another metric? |
Hi, I may contribute WER because I am interested in using it. However, I have a few general questions based on the discussion above: My needs vary:
My question: |
I agree with the extensibility principle, but I'd also like each community to have their own standardized metrics to help with reproducibility in research (maybe Lightning supplies the classification ones as an example, and then maybe the image community will build the ones for the tasks they care about eg segmentation, pose estimation, etc. Same for the speech community which may build WER and so on). I think it's fine if they live in submodules, or even externally as long as the discoverability problem is tackled. I like submodules more because they are more "official" and this pushes people to collaborate on one reference implementation rather than have too many.
Fair enough :) Yep, it's definitely another metric. |
Sure, I think we agree. Like a dataloader like torch-text https://github.com/pytorch/text is not a submodule right? Its an external package.
The reason I am against this is for the same reason! There are a million implementations of BLEU in NLP exactly because everyone wants to standardize it to "their" version. |
Ignite has convinced me that the computation part (DDP included) could be done at the Torch layer (more info in this comment there). Even if people want to reimplement their own BLEU, we'd probably want to land a reference implementation in TorchText. You are free to avoid it if you don't like it, but it's also a pain to write your own so I think you'd do it only if you either don't trust its code, or it's not flexible enough for you, or it's not fast enough/uses too much memory. I'm willing to bet reimplementations will stop if something better is there. Isn't it the whole point of Lightning to stop people for writing their own training loops? :D I see it as very much the same thing. Assuming the PyTorch API in the proposal happens, then maybe there is no API change that Lightning would have to undergo: just write your own class CoolSystem(pl.LightningModule):
def __init__(self):
super(CoolSystem, self).__init__()
# not the best model...
self.l1 = torch.nn.Linear(28 * 28, 10)
self.train_macro_f1 = nn.F1Score("macro")
self.validation_macro_f1 = nn.F1Score("macro")
def forward(self, x):
return torch.relu(self.l1(x.view(x.size(0), -1)))
def training_step(self, batch, batch_idx):
# REQUIRED
x, y = batch
y_hat = self.forward(x)
loss = F.cross_entropy(y_hat, y)
self.train_macro_f1.update(y_hat, y) # void method, just update state
m = self.train_macro_f1() # forward calls compute(), takes no args. Compute from state
tensorboard_logs = {'train_loss': loss, 'train_macro_f1': m}
return {'loss': loss, 'log': tensorboard_logs}
def validation_step(self, batch, batch_idx):
# OPTIONAL
x, y = batch
y_hat = self.forward(x)
self.validation_macro_f1.update(y_hat, y)
return {'val_loss': F.cross_entropy(y_hat, y)}
def validation_end(self, outputs):
# OPTIONAL
avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
tensorboard_logs = {'val_loss': avg_loss, 'val_macro_f1': self.validation_macro_f1()}
return {'avg_val_loss': avg_loss, 'log': tensorboard_logs}
def configure_optimizers(self):
# REQUIRED
# can return multiple optimizers and learning_rate schedulers
# (LBFGS it is automatically supported, no need for closure function)
return torch.optim.Adam(self.parameters(), lr=0.02)
@pl.data_loader
def train_dataloader(self):
# REQUIRED
return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)
@pl.data_loader
def val_dataloader(self):
# OPTIONAL
return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32) There is some boilerplate, but it's not the end of the world |
I would not add another object which would hold metric values, it seems as duplicating logger functionality... What about defining just a list of metrics to be computed (function or callable class) and these metric will be automatically added to output dictionary? |
I like the way Ignites metrics expose their API as previously discussed. Opinion: Motivation for Solution: Solution: EvaluationPlan (pseudo) implementation
|
@oplatek do you want to do it a form like a callback? |
@Borda Yes. The intended use is like this: (I renamed LightningEvaluation -> EvaluationPlan)
We should provide the default EvaluationPlan which accepts any number of metrics and does average of each metric:
What should BatchEpochPlan do? |
I absolutely agree, every |
I agree, that there should be a metrics package, but I would differentiate between the metric itself and the integration to the trainer/module. I like the approach of an 1.) We can postpone the EvaluationPlan implementation if necessary |
First of all, congratulations for this amazing project. I love Lightning. |
@WSzP @Darktex thanks for the awesome feedback! @justusschock will be leading the implementation here. I suspect we can learn a lot from what others have done before and take a stab at this with those lessons learned. I do agree with @srush that some of the metrics are very domain specific, but we can take that into consideration. For some metrics like Bleu, etc... we should make sure to point to the reference paper for implementation correctness. But the few times i had to calculate BLEU made me depressed for weeks haha because of all the hoops I had to jump through. |
there's a related issue #1256 where i proposed that we add a key for |
@justusschock Could we do similar warping as we did for Loggers where we have |
I like the idea, but I'm not sure, we need this here...
This would then work with all numbers of metrics |
I would love to have a metrics module. But I think it would be best to have a Metrics abstract class first, which is basically an aggregator of training outputs and validation outputs. Then I can use sklearn or ignite as the computation tool. Here is what I am thinking. class PLMetric():
def __init__(self, calc_fn, mode='batch | epoch'):
self.mode = mode
# This controls what are the input x and y after every batch.
# If batch, x, y got the output from each step only.
# If epoch, x, y got all the previous outputs in current epoch.
self.prev_outputs = []
def calc(self, x, y):
if self.mode = 'epoch':
self.prev_outputs.append([x, y])
return calc_fn(self.prev_outputs)
return calc_fn(x, y) The Then I could build my metrics like: def my_calc_fn(x, y):
x, y = SOME_PREPROCESSING(x, y)
return SOME_CALC(x, y)
met = PLMetric(my_calc_fn, 'epoch') In a way, I think it makes more sense to have a metrics param in fit. mymodel.fit(model, data_loaders, metrics=[met1, met2]) |
@shijianjian we are currently working on this. But we will add the aggregator together with all the metrics. We have already implemented the base class for these on the metrics branch |
Hi, I am a data scientist and came across Pytorch Lightning a couple of days ago and really LIKE it so much! I am curious as for the progress of the metrics in Lightning. As I can see from 0.8.5 version, there are common metrics there but it seems they were not designed to update in an online fashion and did not solve the DDP issue where the metrics have to be calculated based on y and y_pred from all batches and all GPUs. Did I miss something? Is there a plan to have them soon? I am using it in my project but because of DDP issue I mentioned above, the metrics are calculated in all batches from each GPU instead of all GPUs, I was hoping this could be fixed soon so I can switch completely to Lightning! Thank you so much! |
Hi @junwen-austin , Actually we can only compute metrics on each GPU and sync afterwards, but after #2528 we want to introduce a per-metric sync, so that we sync already computed parts, but make them behave, like we computed it over all samples. |
@justusschock @williamFalcon Thanks for the reply! I can't say enough how an incredible piece of work you guys are doing!!!! Do you guys have an estimate when #2528 will be completed? Also at the moment, what is the best way of calculating a metric correctly on Muti-GPU DDP scenario as in the example in #2528: First machine returns (assume sum of squared error is 200) Thanks again. |
mind shot a PR with adding a test to replicate this case, it would help us to fix... |
@Borda the example above was actually taken from @2528, suggest by @justusschock :) |
🚀 Feature
Introduce a metrics package that contains easy to use reference implementations of metrics people care about. The package should take care of the following:
ConfusionMatrix
metric can use Pandas for terminal (where it will print a nicely tabulated representation) and maybe a color-coded map for notebooks and TensorBoard.MetricsReporter
of some sort that will generate a full report of all the metrics you care about.This would be a rather large change, and I'm not sure what is the best way to do it. This issue is really meant to spur discussion on the topic. The actual solution might require some stuff to be landed on PyTorch, and that's fine.
Motivation
Metrics are a big component of reproducibility. They satisfy all the requirements you can think about to justify standardizing them:
n
orn-1
? (aka: do you use Bessel's correction?). Numpy defaults to not using it, MATLAB and PyTorch default to using it. It's unsurprising to see threads like this.Pitch
I think Lightning should take a page from Ignite's book and add a metrics package.
I also propose that Lightning take care of the following:
validation_step()
andvalidation_end()
steps for each metric, so that they can be computed efficiently per-batch.In this proposal, the API for the LightningModule will be simplified significantly. For example, something like this:
Alternative implementation
If we find a way to have all metrics computation code done in PyTorch, even for DDP, it would be highly preferable I think. I just don't know if it's possible - maybe if we formulate metrics as a layer of sorts we might be able to do that? All standard layers have a state that persists & gets updated across batches (its weights :D ) so maybe we can implement metrics as a sort-of
nn.Module
?Additional context
There is a separate discussion about providing the underlying muscle for this directly in torch (see pytorch/pytorch#22439).
The text was updated successfully, but these errors were encountered: