Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat : Add aggregation from kube service endpoints feature in metrics API scaler #6565

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

julianguinard
Copy link

@julianguinard julianguinard commented Feb 24, 2025

TL:DR

Add the ability for metrics-api scaler to get metrics from all endpoint targets of a kubernetes API service (aggregateFromKubeServiceEndpoints: true metadata) and aggregate them using average, sum, min or max aggregation functions (aggregationType metadata), which is a handy feature in an environment where one didn't set up a metric aggregator/scraping stack (i.e prometheus), or simply doesn't want to use their monitoring stack to fetch and serve metrics from customers workload in their own kubernetes clusters, and leave the metrics API's reponsability up to the customer

This PR comes from an issue met when metrics-api scaler targets an internal kubernetes service, such as the one from metrics_api E2E tests but, unlike it, has more than 1 replica serving it, leading to inconsistent HPA average calculation

The more a kubernetes service used by metrics API scaler has pods as targets, the less likely it is for the metrics API scaler to get all metrics from all these pods, leading to inconsistent HPA average metric computation and eventually to scaling issues. Especially in configurations where kubernetes services have "sticky" configurations

Below are screenshots of the modified E2E metrics_api_test.go, commenting out aggregateFromKubeServiceEndpoints: true addition to metadata, being executed & failing because scale in never occurs as every metric from all 10 replicas behind metrics server deployment are not all well taken into account

Screenshot from 2025-02-24 17-06-35

logs from keda-operator show same value being returned to metrics-api scaler by the kubernetes service (16)
Screenshot from 2025-02-24 16-18-07

As a consequence, test eventually fails because scaled deployment is unable to scale out within 3 minutes timeframe
Screenshot from 2025-02-24 16-19-00

Uncommenting aggregateFromKubeServiceEndpoints: true addition to metadata results in metric aggregation mode and same test passes this time

Screenshot from 2025-02-24 16-22-14
Screenshot from 2025-02-24 16-22-07

Checklist

Helm PR : kedacore/charts#739
Docs PR : kedacore/keda-docs#1541

@julianguinard julianguinard force-pushed the add-metrics-api-aggregation-from-kube-service-feature branch 2 times, most recently from 742e19d to be170fd Compare February 25, 2025 09:57
@julianguinard julianguinard marked this pull request as ready for review February 25, 2025 10:31
@julianguinard julianguinard requested a review from a team as a code owner February 25, 2025 10:31
@julianguinard julianguinard changed the title Add aggregation from kube service endpoints feature in metrics API scaler Feat : Add aggregation from kube service endpoints feature in metrics API scaler Feb 25, 2025
@zroubalik
Copy link
Member

zroubalik commented Mar 5, 2025

/run-e2e metrics_api
Update: You can check the progress here

@julianguinard
Copy link
Author

julianguinard commented Mar 6, 2025

/run-e2e metrics_api Update: You can check the progress here

@zroubalik thanks for triggering E2E, which ran fine here

@julianguinard julianguinard force-pushed the add-metrics-api-aggregation-from-kube-service-feature branch from be170fd to 172e504 Compare March 11, 2025 14:46
…s API (was not visible in E2E tests as server ran by ghcr.io/kedacore/tests-metrics-api seems to accomodate it anyway)

- metrics_api_test.go : fix getUpdateUrlsForAllMetricAllMetricsServerReplicas() to retry more transient failure cases & fail test after 5 failed retries

Signed-off-by: julian GUINARD <[email protected]>
@julianguinard julianguinard force-pushed the add-metrics-api-aggregation-from-kube-service-feature branch from 172e504 to 61ebd1f Compare March 11, 2025 15:02
@julianguinard
Copy link
Author

@zroubalik FYI I added this commit to make E2E more resilient on clusters where launching all metrics servers replicas take a longer time than expected to start : we now sleep 1 second & try 4 more times to validate expected pods before failing

It also fixes formatting of each endpoint URL to reach for aggregating metrics (double "/" in path, did not trigger errors in E2E because server ran by ghcr.io/kedacore/tests-metrics-api seems to cope with it anyway)

@JorTurFer
Copy link
Member

JorTurFer commented Mar 14, 2025

/run-e2e metrics_api
Update: You can check the progress here

@julianguinard
Copy link
Author

/run-e2e metrics_api Update: You can check the progress here

@JorTurFer thanks for E2E trigger which all ran fine

waiting for reviews & opinions on this now :)

Comment on lines +515 to +528
if err != nil {
s.logger.Error(err, "Failed to get kubernetes endpoints urls from configured service URL. Falling back to querying url configured in metadata")
} else {
if len(endpointsUrls) == 0 {
s.logger.Error(err, "No endpoints URLs were given for the service name. Falling back to querying url configured in metadata")
} else {
aggregatedMetric, err := s.aggregateMetricsFromMultipleEndpoints(ctx, endpointsUrls)
if err != nil {
s.logger.Error(err, "No aggregated metrics could be computed from service endpoints. Falling back to querying url configured in metadata")
} else {
return aggregatedMetric, err
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we should fallback here, if a user wants to use the feature and it fails, the fallback can produce unexpected scaling behaviors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants