Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Training with GPU #1

Open
jecummin opened this issue Mar 6, 2025 · 1 comment
Open

Slow Training with GPU #1

jecummin opened this issue Mar 6, 2025 · 1 comment

Comments

@jecummin
Copy link

jecummin commented Mar 6, 2025

I ran the usage example in the README after setting up the repo and environment, but the training speeds seems to be much slower than the README suggests.

> git clone https://github.com/iliao2345/CompressARC.git
> cd CompressARC
> mamba create -n compress
> mamba activate compress
> pip install -r requirements.txt
> python analyze_example.py
Enter which split you want to find the task in (training, evaluation, test): training
Enter which task you want to analyze (eg. 272f95fa): 272f95fa
...

The progress bar estimates that the task will take ~50min to train. Roughly 3-4x slower than the README suggests. I've tried this with both an NVIDIA T4 and an NVIDIA A100 GPU for processing. I confirmed that the GPU was in use during training. By comparison, if I restrict training to only the CPU, training time is estimated to be ~1hr. The GPU doesn't seem to be accelerating much.

I had made no edits to the code.

@iliao2345
Copy link
Owner

Hi there,

Thanks for testing out the code and providing your feedback. All our performance benchmarks were conducted using an NVIDIA RTX 4070, so the training speeds mentioned in the README are based solely on that hardware. I realize that GPUs like the T4 and A100 are typically considered more powerful, but their architectures and optimal usage can differ significantly.

For instance, the A100 can be much more efficient when using FP16 precision, Tensor Cores, and optimizations like torch.compile(), none of which are currently enabled in the code. Additionally, if the model is more compute-heavy rather than memory-bound, it might actually perform better on the 4070 in our specific case.

It might be worth experimenting with FP16 to see if that improves performance on your setup.

Thanks again for reporting this, and please let me know if you discover any other issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants