-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[kube-state-metrics] Use scrapeConfig to have HA #5470
base: main
Are you sure you want to change the base?
Conversation
15a9f0b
to
9a15708
Compare
Whats the issue with double metrics? If could drop instance pod label to equalize them. |
Hi @jkroepke, The issue with double metrics is cost (storage) and memory efficiency (cardinality), why need I double metrics when it is possible to have only one? |
I’m trying to think about possible issues that require the usage of KSM with multiple replicas. Resource Constraints: KSM is using more resources (especially memory) than the defined limits and it’s crash looping due to OOM. I might be wrong, but multiple replicas won’t improve reliability in that case, since both instances will be collecting the same data and using the same resource limits. Rollout of new versions: If you’re rolling out a new version and for some reason it has a startup problem, it’s true that multiple replicas might help here, because the second replica won’t be rolled out until the first one is ready and healthy. But in my opinion, we can achieve the same behavior by properly configuring the deployment strategy, setting configs like minReadySeconds and others. Reschedules: Multiple replicas might also help with pod reschedules, but we can achieve stability using something like PDBs to avoid ending up with zero replicas. If we really want to increase the availability of KSM, maybe we can consider the sharding feature, so you can have multiple KSM instances, each one collecting metrics from specific APIs. |
@nicolastakashi I see still a value for @jenciso requests. It's a low budget version of HA where the cluster is not big enough to need the sharing mode, but you will not lose metrics on rescheduling nodes. For scrape requests against blackbox_exporter, the same principal is used. |
Maybe I could wrong. But, we could say the same for prometheus-HA mode stratey, where each prometheus server scrape and save the same content and share the same memory limits.
IMO, the main idea to having replicas of KSM is avoiding loss metrics during a node downtime (It is a common use-case). We could recommend using topology spreads constrains or a podAntiAffinity (by hostname) together to achieve a better solution.
Yeah, but the problem is the duplicate of metrics basically. This PR dealing it offering an alternative to the use of serviceMonitor and propose the use of scrapeConfig to see the KSM as a simple kubernetes service.
The sharding feature is great, I use it in big cluster when I want to distribuite the load accross multiples KSM instances. But, if you lose a node running a specific shard, you will lose metrics and normally the schedule in statefulset is a headache if the node still in a Terminating/Unknown state. Also, It already happen to Prometheus shards For small/medium clusters, this PR solves a HA problem in a simple way. The KSM is an important service for alerts and SLO's metrics, and we could have a minimal of availability if we look it as a classic Kubernetes service. |
Signed-off-by: Juan Enciso <[email protected]>
What this PR does / why we need it
This PR enables scrapeConfig to be used as alternative to serviceMonitor for kube-state-metrics.
We want to have a way to configure the kube-state-metrics in High Availiability. The simple way to do it is increasing the number of
replicas
to 2. However, it duplicates the kube-state-metrics metrics.With the use of scrapeConfig resource instead of serviceMonitor, you can scrape the kube-state-metrics as a traditional kubernetes services where each replica will expose its endpoint port giving high availability
Which issue this PR fixes
Special notes for your reviewer
Checklist
[prometheus-couchdb-exporter]
)