# Amazon EKS cluster metrics
This example demonstrates how to monitor your Amazon Elastic Kubernetes Service
(Amazon EKS) cluster with the Observability Accelerator's
[EKS monitoring module](https://github.com/aws-observability/terraform-aws-observability-accelerator/tree/main/modules/eks-monitoring).
Monitoring Amazon Elastic Kubernetes Service (Amazon EKS) for metrics has two categories:
the control plane and the Amazon EKS nodes (with Kubernetes objects).
The Amazon EKS control plane consists of control plane nodes that run the Kubernetes software,
such as etcd and the Kubernetes API server. To read more on the components of an Amazon EKS cluster,
please read the [service documentation](https://docs.aws.amazon.com/eks/latest/userguide/clusters.html).
The Amazon EKS infrastructure Terraform modules focuses on metrics collection to Amazon
Managed Service for Prometheus using the [AWS Distro for OpenTelemetry Operator](https://docs.aws.amazon.com/eks/latest/userguide/opentelemetry.html) for Amazon EKS. It deploys the [node exporter](https://github.com/prometheus/node_exporter) and [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics) in your cluster.
It provides default dashboards to get a comprehensible visibility on your nodes,
namespaces, pods, and Kubelet operations health. Finally, you get curated Prometheus recording rules
and alerts to operate your cluster.
Additionally, you can optionally collect custom Prometheus metrics from your applications running
on your EKS cluster.
## Prerequisites
!!! note
Make sure to complete the [prerequisites section](https://aws-observability.github.io/terraform-aws-observability-accelerator/concepts/#prerequisites) before proceeding.
## Setup
#### 1. Download sources and initialize Terraform
```
git clone https://github.com/aws-observability/terraform-aws-observability-accelerator.git
cd examples/existing-cluster-with-base-and-infra
terraform init
```
#### 2. AWS Region
Specify the AWS Region where the resources will be deployed:
```bash
export TF_VAR_aws_region=xxx
```
#### 3. Amazon EKS Cluster
To run this example, you need to provide your EKS cluster name. If you don't
have a cluster ready, visit [this example](https://aws-observability.github.io/terraform-aws-observability-accelerator/helpers/new-eks-cluster/)
first to create a new one.
Specify your cluster name:
```bash
export TF_VAR_eks_cluster_id=xxx
```
#### 4. Amazon Managed Service for Prometheus workspace (optional)
By default, we create an Amazon Managed Service for Prometheus workspace for you.
However, if you have an existing workspace you want to reuse, edit and run:
```bash
export TF_VAR_managed_prometheus_workspace_id=ws-xxx
```
To create a workspace outside of Terraform's state, simply run:
```bash
aws amp create-workspace --alias observability-accelerator --query '.workspaceId' --output text
```
#### 5. Amazon Managed Grafana workspace
To visualize metrics collected, you need an Amazon Managed Grafana workspace. If you have
an existing workspace, create an environment variable as described below.
To create a new workspace, visit [our supporting example for Grafana](https://aws-observability.github.io/terraform-aws-observability-accelerator/helpers/managed-grafana/)
!!! note
For the URL `https://g-xyz.grafana-workspace.eu-central-1.amazonaws.com`, the workspace ID would be `g-xyz`
```bash
export TF_VAR_managed_grafana_workspace_id=g-xxx
```
#### 6. Grafana API Key
Amazon Managed Grafana provides a control plane API for generating Grafana API keys.
As a security best practice, we will provide to Terraform a short lived API key to
run the `apply` or `destroy` command.
Ensure you have necessary IAM permissions (`CreateWorkspaceApiKey, DeleteWorkspaceApiKey`)
!!! note
Starting version v2.5.x and above, we use Grafana Operator and External Secrets to
manage Grafana contents. Your API Key will be stored securely on AWS SSM Parameter Store
and the Grafana Operator will use it to sync dashboards, folders and data sources.
Read more [here](https://aws-observability.github.io/terraform-aws-observability-accelerator/concepts/).
```bash
export TF_VAR_grafana_api_key=`aws grafana create-workspace-api-key --key-name "observability-accelerator-$(date +%s)" --key-role ADMIN --seconds-to-live 7200 --workspace-id $TF_VAR_managed_grafana_workspace_id --query key --output text`
```
## Deploy
Simply run this command to deploy the example
```bash
terraform apply
```
## Visualization
#### 1. Grafana dashboards
Login to your Grafana workspace and navigate to the Dashboards panel. You should see a list of dashboards under the `Observability Accelerator Dashboards`
Open a specific dashboard and you should be able to view its visualization
With v2.5 and above, the dashboards are managed with a Grafana Operator running in your cluster.
From the cluster to view all dashboards as Kubernetes objects, run
```console
kubectl get grafanadashboards -A
NAMESPACE NAME AGE
grafana-operator cluster-grafanadashboard 138m
grafana-operator java-grafanadashboard 143m
grafana-operator kubelet-grafanadashboard 13h
grafana-operator namespace-workloads-grafanadashboard 13h
grafana-operator nginx-grafanadashboard 134m
grafana-operator node-exporter-grafanadashboard 13h
grafana-operator nodes-grafanadashboard 13h
grafana-operator workloads-grafanadashboard 13h
```
You can inspect more details per dashboard using this command
```console
kubectl describe grafanadashboards cluster-grafanadashboard -n grafana-operator
```
Grafana Operator and Flux always work together to synchronize your dashboards with Git.
If you delete your dashboards by accident, they will be re-provisioned automatically.
#### 3. Amazon Managed Service for Prometheus rules and alerts
Open the Amazon Managed Service for Prometheus console and view the details of your workspace. Under the `Rules management` tab, you should find new rules deployed.
!!! note
To setup your alert receiver, with Amazon SNS, follow [this documentation](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-alertmanager-receiver.html)
## Custom Prometheus metrics collection
In addition to the cluster metrics, if you are interested in collecting Prometheus
metrics from your pods, you can use setup `custom metrics collection`.
This will instruct the ADOT collector to scrape your applications metrics based
on the configuration you provide. You can also exclude some of the metrics and save costs.
Using the example, you can edit `examples/existing-cluster-with-base-and-infra/main.tf`.
In the module `module "workloads_infra" {` add the following config (make sure the values matches your use case):
```hcl
enable_custom_metrics = true
custom_metrics_config = {
# list of applications ports (example)
ports = [8000, 8080]
# list of series prefixes you want to discard from ingestion
dropped_series_prefixes = ["go_gcc"]
}
```
After applying Terraform, on Grafana, you can query Prometheus for your application metrics,
create alerts and build on your own dashboards. On the explorer section of Grafana, the
following query will give you the containers exposing metrics that matched the custom metrics
collection, grouped by cluster and node.
```promql
sum(up{job="custom-metrics"}) by (container_name, cluster, nodename)
```
## Troubleshooting
### 1. Grafana dashboards missing or Grafana API key expired
In case you don't see the grafana dashboards in your Amazon Managed Grafana console, check on the logs on your grafana operator pod using the below command :
```bash
kubectl get pods -n grafana-operator
```
Output:
```console
NAME READY STATUS RESTARTS AGE
grafana-operator-866d4446bb-nqq5c 1/1 Running 0 3h17m
```
```bash
kubectl logs grafana-operator-866d4446bb-nqq5c -n grafana-operator
```
Output:
```console
1.6857285045556655e+09 ERROR error reconciling datasource {"controller": "grafanadatasource", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDatasource", "GrafanaDatasource": {"name":"grafanadatasource-sample-amp","namespace":"grafana-operator"}, "namespace": "grafana-operator", "name": "grafanadatasource-sample-amp", "reconcileID": "72cfd60c-a255-44a1-bfbd-88b0cbc4f90c", "datasource": "grafanadatasource-sample-amp", "grafana": "external-grafana", "error": "status: 401, body: {\"message\":\"Expired API key\"}\n"}
github.com/grafana-operator/grafana-operator/controllers.(*GrafanaDatasourceReconciler).Reconcile
```
If you observe, the the above `grafana-api-key error` in the logs, your grafana API key is expired. Please use the operational procedure to update your `grafana-api-key` :
- First, lets create a new Grafana API key.
```bash
export GO_AMG_API_KEY=$(aws grafana create-workspace-api-key \
--key-name "grafana-operator-key-new" \
--key-role "ADMIN" \
--seconds-to-live 432000 \
--workspace-id \
--query key \
--output text)
```
- Finally, update the Grafana API key secret in AWS SSM Parameter Store using the above new Grafana API key:
```bash
aws aws ssm put-parameter \
--name "/terraform-accelerator/grafana-api-key" \
--type "SecureString" \
--value "{\"GF_SECURITY_ADMIN_APIKEY\": \"${GO_AMG_API_KEY}\"}" \
--region
```
- If the issue persists, you can force the synchronization by deleting the `externalsecret` Kubernetes object.
```bash
kubectl delete externalsecret/external-secrets-sm -n grafana-operator
```
### 2. Upgrade from 2.1.0 or earlier
When you upgrade the eks-monitoring module from v2.1.0 or earlier, the following error may occur.
```bash
Error: cannot patch "prometheus-node-exporter" with kind DaemonSet: DaemonSet.apps "prometheus-node-exporter" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/instance":"prometheus-node-exporter", "app.kubernetes.io/name":"prometheus-node-exporter"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
```
This is due to the upgrade of the node-exporter chart from v2 to v4. Manually delete the node-exporter's DaemonSet as described in [the link here](https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-node-exporter#3x-to-4x), and then apply.
```bash
kubectl -n prometheus-node-exporter delete daemonset -l app=prometheus-node-exporter
terraform apply
```