There are three methods to use Kubecost with AMP:
The below guide provide a high-level overview of using AMP with Kubecost. The links above provide detailed instructions for each method.
{% hint style="info" %} Using AMP allows multi-cluster Kubecost with EKS-optimized licenses. {% endhint %}
Kubecost utilizes AWS SigV4 proxy to securely communicate with AMP. It enables password-less authentication using service roles to reduce the risk of exposing credentials.
To support the large-scale infrastructure (over 100 clusters), Kubecost leverages a Federated ETL architecture. In addition to Amazon Prometheus Workspace, Kubecost stores its data in a streamlined format (ETL) and ships this to a central S3 bucket. Kubecost's ETL data is a computed cache based on Prometheus's metrics, from which users can perform all possible Kubecost queries. By storing the ETL data on an S3 bucket, this integration offers resiliency to your cost allocation data, improves the performance and enables high availability architecture for your Kubecost setup.
See the troubleshooting section of this article if you run into any errors while setting up the integration. For support from AWS, you can submit a support request through your existing AWS support contract.
You can add these recording rules to improve the performance. Recording rules allow you to precompute frequently needed or computationally expensive expressions and save their results as a new set of time series. Querying the precomputed result is often much faster than running the original expression every time it is needed. Follow these AWS instructions to add the following rules:
{% code overflow="wrap" %}
groups:
- name: CPU
rules:
- expr: sum(rate(container_cpu_usage_seconds_total{container_name!=""}[5m]))
record: cluster:cpu_usage:rate5m
- expr: rate(container_cpu_usage_seconds_total{container_name!=""}[5m])
record: cluster:cpu_usage_nosum:rate5m
- expr: avg(irate(container_cpu_usage_seconds_total{container_name!="POD", container_name!=""}[5m])) by (container_name,pod_name,namespace)
record: kubecost_container_cpu_usage_irate
- expr: sum(container_memory_working_set_bytes{container_name!="POD",container_name!=""}) by (container_name,pod_name,namespace)
record: kubecost_container_memory_working_set_bytes
- expr: sum(container_memory_working_set_bytes{container_name!="POD",container_name!=""})
record: kubecost_cluster_memory_working_set_bytes
- name: Savings
rules:
- expr: sum(avg(kube_pod_owner{owner_kind!="DaemonSet"}) by (pod) * sum(container_cpu_allocation) by (pod))
record: kubecost_savings_cpu_allocation
labels:
daemonset: "false"
- expr: sum(avg(kube_pod_owner{owner_kind="DaemonSet"}) by (pod) * sum(container_cpu_allocation) by (pod)) / sum(kube_node_info)
record: kubecost_savings_cpu_allocation
labels:
daemonset: "true"
- expr: sum(avg(kube_pod_owner{owner_kind!="DaemonSet"}) by (pod) * sum(container_memory_allocation_bytes) by (pod))
record: kubecost_savings_memory_allocation_bytes
labels:
daemonset: "false"
- expr: sum(avg(kube_pod_owner{owner_kind="DaemonSet"}) by (pod) * sum(container_memory_allocation_bytes) by (pod)) / sum(kube_node_info)
record: kubecost_savings_memory_allocation_bytes
labels:
daemonset: "true"
{% endcode %}
The RunDiagnostic
logs in the cost-model container will contain the most useful information.
kubectl logs -n kubecost deployments/kubecost-cost-analyzer cost-model |grep RunDiagnostic
Test to see if the Kubecost metrics are available using Grafana or exec into the Kubecost frontend to run a cURL against the AMP endpoint:
Grafana query:
count({__name__=~".+"}) by (job)
Port-forward to cost-model:9090:
kubectl port-forward -n kubecost svc/kubecost-cost-analyzer 9090:9090
Direct linklocalhost:9090
Or exec command:
kubectl exec -i -t \
deployments/kubecost-cost-analyzer \
-c cost-analyzer-frontend -- \
curl -G "0:9090/model/prometheusQuery" \
--data-urlencode "query=node_total_hourly_cost"
Failure:
{"status":"success","data":{"resultType":"vector","result":[]}}
Success:
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"__name__": "node_total_hourly_cost",
"arch": "amd64",
"cluster_id": "eks-integration",
"instance": "ip-172-31-9-41.us-east-2.compute.internal",
"instance_type": "m6a.xlarge",
"job": "kubecost-metrics",
"node": "ip-172-31-9-41.us-east-2.compute.internal",
"provider_id": "aws:///us-east-2a/i-0d844bf800d01bde1",
"region": "us-east-2"
},
"value": [
1709403009,
"0.1728077431907654"
]
}
]
}
}
The below queries must return data for Kubecost to calculate costs correctly.
For the queries below to work, set the environment variables:
KUBECOST_NAMESPACE=kubecost
KUBECOST_DEPLOYMENT=kubecost-cost-analyzer
CLUSTER_ID=YOUR_CLUSTER_NAME
- Verify connection to AMP and that the metric for
container_memory_working_set_bytes
is available:
If you have set kubecostModel.promClusterIDLabel
, you will need to change the query (CLUSTER_ID
) to match the label (typically cluster
or alpha_eksctl_io_cluster_name
).
kubectl exec -i -t -n $KUBECOST_NAMESPACE \
deployments/$KUBECOST_DEPLOYMENT -c cost-analyzer-frontend \
-- curl "0:9090/model/prometheusQuery?query=container_memory_working_set_bytes\{CLUSTER_ID=\"$CLUSTER_ID\"\}"
The output should contain a JSON entry similar to the following.
The value of cluster_id
should match the value of kubecostProductConfigs.clusterName
.
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"__name__": "container_memory_working_set_bytes",
"cluster_id": "qa-eks1",
"alpha_eksctl_io_cluster_name": "qa-eks1",
"alpha_eksctl_io_nodegroup_name": "qa-eks1-nodegroup",
"beta_kubernetes_io_arch": "amd64",
"beta_kubernetes_io_instance_type": "t3.medium",
"beta_kubernetes_io_os": "linux",
"eks_amazonaws_com_capacityType": "ON_DEMAND",
"eks_amazonaws_com_nodegroup": "qa-eks1-nodegroup",
"id": "/",
"instance": "ip-10-10-8-66.us-east-2.compute.internal",
"job": "kubernetes-nodes-cadvisor"
},
"value": [
1697630036,
"3043811328"
]
}
]
}
}
- Verify Kubecost metrics are available in AMP:
kubectl exec -i -t -n $KUBECOST_NAMESPACE \
deployments/$KUBECOST_DEPLOYMENT -c cost-analyzer-frontend \
-- curl "0:9090/model/prometheusQuery?query=node_total_hourly_cost\{CLUSTER_ID=\"$CLUSTER_ID\"\}" \
|jq
The output should contain a JSON entry similar to:
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"__name__": "node_total_hourly_cost",
"cluster_id": "qa-eks1",
"alpha_eksctl_io_cluster_name": "qa-eks1",
"arch": "amd64",
"instance": "ip-192-168-47-226.us-east-2.compute.internal",
"instance_type": "t3.medium",
"job": "kubecost"
},
"value": [
1697630306,
"0.04160104542160034"
]
}
]
}
}
If the above queries fail, check the following:
- Check logs of the
sigv4proxy
container (may be the Kubecost deployment or Prometheus Server deployment depending on your setup):
kubectl logs deployments/$KUBECOST_DEPLOYMENT -c sigv4proxy --tail -1
In a working sigv4proxy
, there will be very few logs.
Correctly working log output:
time="2023-09-21T17:40:15Z" level=info msg="Stripping headers []" StripHeaders="[]"
time="2023-09-21T17:40:15Z" level=info msg="Listening on :8005" port=":8005"
- Check logs in the `cost-model`` container for Prometheus connection issues:
kubectl logs deployments/$KUBECOST_DEPLOYMENT -c cost-model --tail -1 |grep -i err
Example errors:
ERR CostModel.ComputeAllocation: pod query 1 try 2 failed: avg(kube_pod_container_status_running...
Prometheus communication error: 502 (Bad Gateway) ...