Adding NVIDIA-SMI like information #2074

groadabike · 2020-06-04T16:45:50Z

🚀 Feature

Add the GPU usage information during training.

Motivation

Most of the research is done on HPC. Therefore, if I want to see the GPU RAM and usage of my job, I have to open a secondary screen to run "watch nvidia-smi" or "nvidia-smi dmon".
Have this info saved in the logs will help to:

See if I have space for larger batches
Report the correct resources needed to replicate my experiment.

Pitch

When training starts, report the GPU RAM and the GPU usage together with loss and v_num

Alternatives

After the first epoch is loaded into the GPU, log the GPU RAM and the GPU usage

Additional context

github-actions · 2020-06-04T16:46:28Z

Hi! thanks for your contribution!, great first issue!

Borda · 2020-06-10T23:10:01Z

add 1) you can use Batch finder
add 2) how it is different from a logger?
cc: @jeremyjordan @SkafteNicki

SkafteNicki · 2020-06-11T14:55:34Z

If your goal is just to optimize for batch size, then the batch finder may be what you are looking for.
If we where to log the resource usage, I guess that we could write a callback similar to LearningRateLogger that extracts this information (through nvidia-smi or maybe gpustat?) and logs these numbers to the logger of the trainer.

groadabike · 2020-06-12T10:03:47Z

Hi @SkafteNicki @Borda ,
In fact, what I am looking are both things. First, optimize the batch size that fits in my GPU and then to keep logging the GPU usage.
Recently, I come across with https://github.com/sicara/gpumonitor that implements a pl's callback.
I will test if this gpumonitor has what I was looking for.
Thank you for your comments

Borda · 2020-08-04T20:01:16Z

@groadabike mind send a PR with PL callback?

groadabike · 2020-08-07T12:25:29Z

Hi @Borda , I try to use the gpumonitor callback but it didn't work in my HPC.
For some reason, the training stop waiting for something.
I can't send a PR as I don't have any callback implemented.
I am still in need to know the GPU utilisation because I know I have a bottleneck in the dataloader (using the profiler), but I don't know for how long the GPU is waiting for the next batch.
Will back to you when I solve this issue

groadabike · 2020-08-11T11:54:39Z

Hi Borda,
I have a first attempt for my Callback to measure the GPU usage and GPU "dead" periods.
Can you take a look at it and give me your feedback?
I am doing several measures and logging in tensorboard:

1- Time between batches - the time between the end of one batch and the start of the next.
2- Time in batch - the time between the start and end of one batch

3- GPU utilisation - % of GPU utilisation measured at the beginning and end of each batch.

4- GPU memory used

5- GPU memory free

gpuusage_callback.zip

SkafteNicki · 2020-08-11T13:03:30Z

@groadabike i think that looks like a great addition. I you want to submit a PR, feel free :)
Personally I would also add flags for temperature (temperature.gpu and temperature.memory query) and fans (fan.speed query), both can be disabled by default. For the memory_utilization flag I would also log utilization.memory.

SkafteNicki · 2020-09-23T11:19:18Z

closing this as it was solved by PR #2932

groadabike added feature Is an improvement or enhancement help wanted Open to be worked on labels Jun 4, 2020

Borda added the good first issue Good for newcomers label Aug 4, 2020

Borda added the let's do it! approved to implement label Aug 11, 2020

SkafteNicki closed this as completed Sep 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding NVIDIA-SMI like information #2074

Adding NVIDIA-SMI like information #2074

groadabike commented Jun 4, 2020

github-actions bot commented Jun 4, 2020

Borda commented Jun 10, 2020

SkafteNicki commented Jun 11, 2020

groadabike commented Jun 12, 2020 •

edited

Loading

Borda commented Aug 4, 2020

groadabike commented Aug 7, 2020

groadabike commented Aug 11, 2020

SkafteNicki commented Aug 11, 2020

SkafteNicki commented Sep 23, 2020

Adding NVIDIA-SMI like information #2074

Adding NVIDIA-SMI like information #2074

Comments

groadabike commented Jun 4, 2020

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

github-actions bot commented Jun 4, 2020

Borda commented Jun 10, 2020

SkafteNicki commented Jun 11, 2020

groadabike commented Jun 12, 2020 • edited Loading

Borda commented Aug 4, 2020

groadabike commented Aug 7, 2020

groadabike commented Aug 11, 2020

SkafteNicki commented Aug 11, 2020

SkafteNicki commented Sep 23, 2020

groadabike commented Jun 12, 2020 •

edited

Loading