You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unfortunately for a large setup we can easily breach default limits, 1.5Mb: etcdserver: request is too large
It's not always possible to bump max-request-bytes, e.g. for AWS EKS.
Currently both job-specific and even TorchX own environment variables are contributing to breaching this limit.
We would like to find a way to make replicas work to minimize job manifest size.
Motivation/Background
Increase the maximum cluster size we can support with k8s
Detailed Proposal
E.g. using ConfigMap with per node/role config or Downward API. Make use of the fact we have roles with many replicas that share a huge chunk of their configuration.
Alternatives
Don't use environment variables and long names anywhere in the configuration, still the limit will be significantly smaller than when using replicas on average.
Description
Using
replicas
for repetitive pod configuration inkubernetes_scheduler
has been removed in f6907e8The rationale is here
Unfortunately for a large setup we can easily breach default limits, 1.5Mb:
etcdserver: request is too large
It's not always possible to bump
max-request-bytes
, e.g. for AWS EKS.Currently both job-specific and even TorchX own environment variables are contributing to breaching this limit.
We would like to find a way to make replicas work to minimize job manifest size.
Motivation/Background
Increase the maximum cluster size we can support with k8s
Detailed Proposal
E.g. using ConfigMap with per node/role config or Downward API. Make use of the fact we have roles with many replicas that share a huge chunk of their configuration.
Alternatives
Don't use environment variables and long names anywhere in the configuration, still the limit will be significantly smaller than when using replicas on average.
Additional context/links
The text was updated successfully, but these errors were encountered: