# Amazon EKS cluster metrics This example demonstrates how to monitor your Amazon Elastic Kubernetes Service (Amazon EKS) cluster with the Observability Accelerator's [EKS monitoring module](https://github.com/aws-observability/terraform-aws-observability-accelerator/tree/main/modules/eks-monitoring). Monitoring Amazon Elastic Kubernetes Service (Amazon EKS) for metrics has two categories: the control plane and the Amazon EKS nodes (with Kubernetes objects). The Amazon EKS control plane consists of control plane nodes that run the Kubernetes software, such as etcd and the Kubernetes API server. To read more on the components of an Amazon EKS cluster, please read the [service documentation](https://docs.aws.amazon.com/eks/latest/userguide/clusters.html). The Amazon EKS infrastructure Terraform modules focuses on metrics collection to Amazon Managed Service for Prometheus using the [AWS Distro for OpenTelemetry Operator](https://docs.aws.amazon.com/eks/latest/userguide/opentelemetry.html) for Amazon EKS. It deploys the [node exporter](https://github.com/prometheus/node_exporter) and [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics) in your cluster. It provides default dashboards to get a comprehensible visibility on your nodes, namespaces, pods, and Kubelet operations health. Finally, you get curated Prometheus recording rules and alerts to operate your cluster. Additionally, you can optionally collect custom Prometheus metrics from your applications running on your EKS cluster. ## Prerequisites !!! note Make sure to complete the [prerequisites section](https://aws-observability.github.io/terraform-aws-observability-accelerator/concepts/#prerequisites) before proceeding. ## Setup #### 1. Download sources and initialize Terraform ``` git clone https://github.com/aws-observability/terraform-aws-observability-accelerator.git cd examples/existing-cluster-with-base-and-infra terraform init ``` #### 2. AWS Region Specify the AWS Region where the resources will be deployed: ```bash export TF_VAR_aws_region=xxx ``` #### 3. Amazon EKS Cluster To run this example, you need to provide your EKS cluster name. If you don't have a cluster ready, visit [this example](https://aws-observability.github.io/terraform-aws-observability-accelerator/helpers/new-eks-cluster/) first to create a new one. Specify your cluster name: ```bash export TF_VAR_eks_cluster_id=xxx ``` #### 4. Amazon Managed Service for Prometheus workspace (optional) By default, we create an Amazon Managed Service for Prometheus workspace for you. However, if you have an existing workspace you want to reuse, edit and run: ```bash export TF_VAR_managed_prometheus_workspace_id=ws-xxx ``` To create a workspace outside of Terraform's state, simply run: ```bash aws amp create-workspace --alias observability-accelerator --query '.workspaceId' --output text ``` #### 5. Amazon Managed Grafana workspace To visualize metrics collected, you need an Amazon Managed Grafana workspace. If you have an existing workspace, create an environment variable as described below. To create a new workspace, visit [our supporting example for Grafana](https://aws-observability.github.io/terraform-aws-observability-accelerator/helpers/managed-grafana/) !!! note For the URL `https://g-xyz.grafana-workspace.eu-central-1.amazonaws.com`, the workspace ID would be `g-xyz` ```bash export TF_VAR_managed_grafana_workspace_id=g-xxx ``` #### 6. Grafana API Key Amazon Managed Grafana provides a control plane API for generating Grafana API keys. As a security best practice, we will provide to Terraform a short lived API key to run the `apply` or `destroy` command. Ensure you have necessary IAM permissions (`CreateWorkspaceApiKey, DeleteWorkspaceApiKey`) !!! note Starting version v2.5.x and above, we use Grafana Operator and External Secrets to manage Grafana contents. Your API Key will be stored securely on AWS SSM Parameter Store and the Grafana Operator will use it to sync dashboards, folders and data sources. Read more [here](https://aws-observability.github.io/terraform-aws-observability-accelerator/concepts/). ```bash export TF_VAR_grafana_api_key=`aws grafana create-workspace-api-key --key-name "observability-accelerator-$(date +%s)" --key-role ADMIN --seconds-to-live 7200 --workspace-id $TF_VAR_managed_grafana_workspace_id --query key --output text` ``` ## Deploy Simply run this command to deploy the example ```bash terraform apply ``` ## Visualization #### 1. Grafana dashboards Login to your Grafana workspace and navigate to the Dashboards panel. You should see a list of dashboards under the `Observability Accelerator Dashboards` image Open a specific dashboard and you should be able to view its visualization cluster headlines With v2.5 and above, the dashboards are managed with a Grafana Operator running in your cluster. From the cluster to view all dashboards as Kubernetes objects, run ```console kubectl get grafanadashboards -A NAMESPACE NAME AGE grafana-operator cluster-grafanadashboard 138m grafana-operator java-grafanadashboard 143m grafana-operator kubelet-grafanadashboard 13h grafana-operator namespace-workloads-grafanadashboard 13h grafana-operator nginx-grafanadashboard 134m grafana-operator node-exporter-grafanadashboard 13h grafana-operator nodes-grafanadashboard 13h grafana-operator workloads-grafanadashboard 13h ``` You can inspect more details per dashboard using this command ```console kubectl describe grafanadashboards cluster-grafanadashboard -n grafana-operator ``` Grafana Operator and Flux always work together to synchronize your dashboards with Git. If you delete your dashboards by accident, they will be re-provisioned automatically. #### 3. Amazon Managed Service for Prometheus rules and alerts Open the Amazon Managed Service for Prometheus console and view the details of your workspace. Under the `Rules management` tab, you should find new rules deployed. image !!! note To setup your alert receiver, with Amazon SNS, follow [this documentation](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-alertmanager-receiver.html) ## Custom Prometheus metrics collection In addition to the cluster metrics, if you are interested in collecting Prometheus metrics from your pods, you can use setup `custom metrics collection`. This will instruct the ADOT collector to scrape your applications metrics based on the configuration you provide. You can also exclude some of the metrics and save costs. Using the example, you can edit `examples/existing-cluster-with-base-and-infra/main.tf`. In the module `module "workloads_infra" {` add the following config (make sure the values matches your use case): ```hcl enable_custom_metrics = true custom_metrics_config = { # list of applications ports (example) ports = [8000, 8080] # list of series prefixes you want to discard from ingestion dropped_series_prefixes = ["go_gcc"] } ``` After applying Terraform, on Grafana, you can query Prometheus for your application metrics, create alerts and build on your own dashboards. On the explorer section of Grafana, the following query will give you the containers exposing metrics that matched the custom metrics collection, grouped by cluster and node. ```promql sum(up{job="custom-metrics"}) by (container_name, cluster, nodename) ``` Screenshot 2023-01-31 at 11 16 21 ## Troubleshooting ### 1. Grafana dashboards missing or Grafana API key expired In case you don't see the grafana dashboards in your Amazon Managed Grafana console, check on the logs on your grafana operator pod using the below command : ```bash kubectl get pods -n grafana-operator ``` Output: ```console NAME READY STATUS RESTARTS AGE grafana-operator-866d4446bb-nqq5c 1/1 Running 0 3h17m ``` ```bash kubectl logs grafana-operator-866d4446bb-nqq5c -n grafana-operator ``` Output: ```console 1.6857285045556655e+09 ERROR error reconciling datasource {"controller": "grafanadatasource", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDatasource", "GrafanaDatasource": {"name":"grafanadatasource-sample-amp","namespace":"grafana-operator"}, "namespace": "grafana-operator", "name": "grafanadatasource-sample-amp", "reconcileID": "72cfd60c-a255-44a1-bfbd-88b0cbc4f90c", "datasource": "grafanadatasource-sample-amp", "grafana": "external-grafana", "error": "status: 401, body: {\"message\":\"Expired API key\"}\n"} github.com/grafana-operator/grafana-operator/controllers.(*GrafanaDatasourceReconciler).Reconcile ``` If you observe, the the above `grafana-api-key error` in the logs, your grafana API key is expired. Please use the operational procedure to update your `grafana-api-key` : - First, lets create a new Grafana API key. ```bash export GO_AMG_API_KEY=$(aws grafana create-workspace-api-key \ --key-name "grafana-operator-key-new" \ --key-role "ADMIN" \ --seconds-to-live 432000 \ --workspace-id \ --query key \ --output text) ``` - Finally, update the Grafana API key secret in AWS SSM Parameter Store using the above new Grafana API key: ```bash aws aws ssm put-parameter \ --name "/terraform-accelerator/grafana-api-key" \ --type "SecureString" \ --value "{\"GF_SECURITY_ADMIN_APIKEY\": \"${GO_AMG_API_KEY}\"}" \ --region ``` - If the issue persists, you can force the synchronization by deleting the `externalsecret` Kubernetes object. ```bash kubectl delete externalsecret/external-secrets-sm -n grafana-operator ``` ### 2. Upgrade from 2.1.0 or earlier When you upgrade the eks-monitoring module from v2.1.0 or earlier, the following error may occur. ```bash Error: cannot patch "prometheus-node-exporter" with kind DaemonSet: DaemonSet.apps "prometheus-node-exporter" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/instance":"prometheus-node-exporter", "app.kubernetes.io/name":"prometheus-node-exporter"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable ``` This is due to the upgrade of the node-exporter chart from v2 to v4. Manually delete the node-exporter's DaemonSet as described in [the link here](https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-node-exporter#3x-to-4x), and then apply. ```bash kubectl -n prometheus-node-exporter delete daemonset -l app=prometheus-node-exporter terraform apply ```