-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat : Add aggregation from kube service endpoints feature in metrics API scaler #6565
base: main
Are you sure you want to change the base?
Feat : Add aggregation from kube service endpoints feature in metrics API scaler #6565
Conversation
742e19d
to
be170fd
Compare
/run-e2e metrics_api |
@zroubalik thanks for triggering E2E, which ran fine here |
…aler Signed-off-by: julian GUINARD <[email protected]>
be170fd
to
172e504
Compare
…s API (was not visible in E2E tests as server ran by ghcr.io/kedacore/tests-metrics-api seems to accomodate it anyway) - metrics_api_test.go : fix getUpdateUrlsForAllMetricAllMetricsServerReplicas() to retry more transient failure cases & fail test after 5 failed retries Signed-off-by: julian GUINARD <[email protected]>
172e504
to
61ebd1f
Compare
@zroubalik FYI I added this commit to make E2E more resilient on clusters where launching all metrics servers replicas take a longer time than expected to start : we now sleep 1 second & try 4 more times to validate expected pods before failing It also fixes formatting of each endpoint URL to reach for aggregating metrics (double "/" in path, did not trigger errors in E2E because server ran by ghcr.io/kedacore/tests-metrics-api seems to cope with it anyway) |
/run-e2e metrics_api |
@JorTurFer thanks for E2E trigger which all ran fine waiting for reviews & opinions on this now :) |
if err != nil { | ||
s.logger.Error(err, "Failed to get kubernetes endpoints urls from configured service URL. Falling back to querying url configured in metadata") | ||
} else { | ||
if len(endpointsUrls) == 0 { | ||
s.logger.Error(err, "No endpoints URLs were given for the service name. Falling back to querying url configured in metadata") | ||
} else { | ||
aggregatedMetric, err := s.aggregateMetricsFromMultipleEndpoints(ctx, endpointsUrls) | ||
if err != nil { | ||
s.logger.Error(err, "No aggregated metrics could be computed from service endpoints. Falling back to querying url configured in metadata") | ||
} else { | ||
return aggregatedMetric, err | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we should fallback here, if a user wants to use the feature and it fails, the fallback can produce unexpected scaling behaviors
TL:DR
Add the ability for metrics-api scaler to get metrics from all endpoint targets of a kubernetes API service (
aggregateFromKubeServiceEndpoints: true
metadata) and aggregate them usingaverage
,sum
,min
ormax
aggregation functions (aggregationType
metadata), which is a handy feature in an environment where one didn't set up a metric aggregator/scraping stack (i.e prometheus), or simply doesn't want to use their monitoring stack to fetch and serve metrics from customers workload in their own kubernetes clusters, and leave the metrics API's reponsability up to the customerThis PR comes from an issue met when metrics-api scaler targets an internal kubernetes service, such as the one from metrics_api E2E tests but, unlike it, has more than 1 replica serving it, leading to inconsistent HPA average calculation
The more a kubernetes service used by metrics API scaler has pods as targets, the less likely it is for the metrics API scaler to get all metrics from all these pods, leading to inconsistent HPA average metric computation and eventually to scaling issues. Especially in configurations where kubernetes services have "sticky" configurations
Below are screenshots of the modified E2E metrics_api_test.go, commenting out
aggregateFromKubeServiceEndpoints: true
addition to metadata, being executed & failing because scale in never occurs as every metric from all 10 replicas behind metrics server deployment are not all well taken into accountlogs from keda-operator show same value being returned to metrics-api scaler by the kubernetes service (

16
)As a consequence, test eventually fails because scaled deployment is unable to scale out within 3 minutes timeframe

Uncommenting
aggregateFromKubeServiceEndpoints: true
addition to metadata results in metric aggregation mode and same test passes this timeChecklist
Helm PR : kedacore/charts#739
Docs PR : kedacore/keda-docs#1541