Skip to content

Use k8s volcano replicas to shrink job manifest size #1054

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
clumsy opened this issue Apr 26, 2025 · 2 comments
Open

Use k8s volcano replicas to shrink job manifest size #1054

clumsy opened this issue Apr 26, 2025 · 2 comments

Comments

@clumsy
Copy link
Contributor

clumsy commented Apr 26, 2025

Description

Using replicas for repetitive pod configuration in kubernetes_scheduler has been removed in f6907e8

The rationale is here

Unfortunately for a large setup we can easily breach default limits, 1.5Mb: etcdserver: request is too large
It's not always possible to bump max-request-bytes, e.g. for AWS EKS.

Currently both job-specific and even TorchX own environment variables are contributing to breaching this limit.

We would like to find a way to make replicas work to minimize job manifest size.

Motivation/Background

Increase the maximum cluster size we can support with k8s

Detailed Proposal

E.g. using ConfigMap with per node/role config or Downward API. Make use of the fact we have roles with many replicas that share a huge chunk of their configuration.

Alternatives

Don't use environment variables and long names anywhere in the configuration, still the limit will be significantly smaller than when using replicas on average.

Additional context/links

@clumsy
Copy link
Contributor Author

clumsy commented Apr 26, 2025

For your consideration, @kiukchung, @tonykao8080, @andywag, @d4l3k

Willing to contribute as always if there are no objections.

Thanks!

@andywag
Copy link
Contributor

andywag commented Apr 26, 2025

I don't have objections @kiukchung do you have any thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants