Use k8s volcano replicas to shrink job manifest size #1054

clumsy · 2025-04-26T00:57:16Z

Description

Using replicas for repetitive pod configuration in kubernetes_scheduler has been removed in f6907e8

The rationale is here

Unfortunately for a large setup we can easily breach default limits, 1.5Mb: etcdserver: request is too large
It's not always possible to bump max-request-bytes, e.g. for AWS EKS.

Currently both job-specific and even TorchX own environment variables are contributing to breaching this limit.

We would like to find a way to make replicas work to minimize job manifest size.

Motivation/Background

Increase the maximum cluster size we can support with k8s

Detailed Proposal

E.g. using ConfigMap with per node/role config or Downward API. Make use of the fact we have roles with many replicas that share a huge chunk of their configuration.

Alternatives

Don't use environment variables and long names anywhere in the configuration, still the limit will be significantly smaller than when using replicas on average.

Additional context/links

The text was updated successfully, but these errors were encountered:

clumsy · 2025-04-26T00:58:43Z

For your consideration, @kiukchung, @tonykao8080, @andywag, @d4l3k

Willing to contribute as always if there are no objections.

Thanks!

andywag · 2025-04-26T15:09:53Z

I don't have objections @kiukchung do you have any thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use k8s volcano replicas to shrink job manifest size #1054

Use k8s volcano replicas to shrink job manifest size #1054

clumsy commented Apr 26, 2025

clumsy commented Apr 26, 2025

andywag commented Apr 26, 2025

Use k8s volcano replicas to shrink job manifest size #1054

Use k8s volcano replicas to shrink job manifest size #1054

Comments

clumsy commented Apr 26, 2025

Description

Motivation/Background

Detailed Proposal

Alternatives

Additional context/links

clumsy commented Apr 26, 2025

andywag commented Apr 26, 2025