# EKS Autoscaling with Karpenter ## What is Karpenter? [Karpenter](https://karpenter.sh) is an open source controller for Kubernetes which enables auto-scaling. ## How is Karpenter different than the traditional Cluster Autoscaler? The main difference is that Karpenter does not rely on node groups like Cluster Autoscaler to expand or contract the cluster. It is claimed that Karpenter is both more performant and cost effective than Cluster Autoscaler. More details can be found in [this article](https://towardsdev.com/karpenter-vs-cluster-autoscaler-dd877b91629b). ## How to set up Karpenter on EKS Scripts to automate deployment of Karpenter in this project have been created following instructions from this [walkthrough](Walktrough: https://karpenter.sh/v0.10.0/getting-started/getting-started-with-eksctl/) 1. Build and run aws-do-eks project, exec into aws-do-eks container. 2. Edit eks-karpenter.yaml as needed 3. Edit `eks.conf` 3.1. set CONFIG=yaml 3.2. set EKS_YAML=./eks-karpenter.yaml 4. Create cluster by executing `./eks-create.sh 5. Refer to scripts in the /eks/deployment/karpenter directory 5.1. Execute `./deploy.sh` to configure the necessary roles and deploy the controller 5.2. Optionally execute `./monitoring-deploy.sh` to enable prometheus/grafana monitoring of Karpenter metrics 5.3. Execute `./provisioner-deploy.sh` to configure Karpenter for the cluster 6. Test Karpenter 6.1. Execute `./test-scale-out.sh 6.2. Use `./logs.sh` script to monitor karpenter operations 6.3. Use `./scale.sh ` to change the test scale 6.4. Execute `./test-scale-in.sh` to stop testing ## How to use Kaprenter with EFA [Certain Amazon EC2 instance types](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types) are enabled with a high-performance network adapter known as [Elastic Fabric Adapter (EFA)](https://aws.amazon.com/hpc/efa/) As of version 0.16.3 Karpenter is not able to automatically provision instances when EFA resources are requested. To enable autoscaling of EFA-enabled instances using Karpenter we can configure a provisioner with a custom launch template which enables EFA. When the provisioner is limited only to the desired EFA-enabled instance type, Karpenter will automatically create instances with EFA adapters to satisfy known resource requests like GPU. In order to use the EFA device in a pod, the path `/dev/infiniband` needs to be mounted as a `hostPath` volume. This requires the pod to run in privileged mode. To remove this requirement, Karpenter needs to be modified to recognize instance types, which have support for EFA. An advantage of configuring a traditional cluster with Karpenter is that it enables manuall and automatic scaling at the same time. We can manually set autoscaling groups to desired minimal instance counts, and rely on Karpenter to add nodes to the cluster only when excess capacity is needed. An example of a custom launch template is provided in file [./launch-template-efa-example.json](./launch-template-efa-example.json). An example provisioner configuration is provided in file [./provisioner-efa.yaml](./provisioner-efa.yaml). A recommended process for setting up Karpenter with EFA is described below: 1. Build and run the aws-do-eks project, exec into aws-do-eks container. 2. Edit eks-karpenter.yaml 2.1. Configure a system node group where Karpenter pods would run 2.2. Add node groups with desired instances that have EFA capabilities 3. Edit `eks.conf 3.1. set CONFIG=yaml 3.2. set EKS_YAML=./eks-karpenter.yaml 4. Create cluster by executing `./eks-create.sh` within the `aws-do-eks` container 5. Refer to scripts in the /eks/deployment/karpenter directory 5.1. Execute `./migrate-auth.sh` and copy the authorization group to your clipboard ```yaml Example: - groups: - system:bootstrappers - system:nodes rolearn: arn:aws:iam::${AWS_ACCOUNT_ID}:role/KarpenterInstanceNodeRole username: system:node:{{EC2PrivateDNSName}} ``` 5.2. Execute `./migrate.sh` to migrate the cluster from autoscaler and deploy Karpenter When asked to edit the aws-auth configmap, paste the content you copied from your clipboard. 5.3. Modify the launch templates of your node groups The modified launch template must include the following fields that are required by Karpenter: AMI, KeyPairName, NetworkSettings->Advanced Network Configuration->Elastic Fabric Adapter->Enable, IAM instance profile: KarpenterInstanceProfile, Storage->DiskSize: 200GB. Set the default version to be the new template revision. Check that UserData in the template has the Kubelet bootstrap scripts and if EFA is not already in the AMI, the EFA driver install script. 5.4. Edit and apply file `./provisioner-efa.yaml` Set instance types, launch template name as needed. If you need to use multiple launch templates, then you would need to configure multiple provisioners. ```bash kubectl apply -f ./provisioner-efa.yaml ``` 6. Test Karpenter 6.1. Execute `./test-scale-out.sh`, you should see EFA-enabled instances get added to the cluster 6.2. Execute `./test-scale-in.sh`, the new instances shoudl be removed from the cluster and terminated 7. Test EFA 7.1. Deploy Kubeflow MPI Operator ```bash cd /eks/deployment/kubeflow/mpi-operator; ./deploy.sh ``` 7.2. Build `cuda-efa-nccl-tests` container and push it to ECR ```bash cd /eks/deployment/efa-device-plugin/cuda-efa-nccl-tests ./build.sh ./push.sh ``` 7.3. Edit /eks/deployment/efa-device-plugin/test-nccl.yaml and test-nccl-efa.yaml 7.3.1. Replace image with the image you built and pushed to ECR 7.3.2. Comment out unsupported resource requests and limits ```yaml #huigepages-2Mi: 5120Mi #vpc.amazonaws.com/efa: 1 ``` 7.3.3. Set desired number of GPU request and limits ```yaml nvidia.com/gpu: 1 ``` 7.4. Execute tests 7.4.1. Execute test with EFA disabled ```bash cd /eks/deployment/efa-device-plugin kubectl delete mpijob --all kubectl apply -f ./test-nccl.yaml ``` The worker pods will be in status `Pending` and the launcher pod will be in status `CrashLoopBackOff` until Karpenter adds new nodes to the cluster and the nodes become `Ready`, then the worker and launcher pods will enter the `Running state. When the launcher pod is in `Running` or `Completed` state extract the pod logs to review the test results. ```bash kubectl logs -f $(kubectl get pods | grep launcher | cut -d ' ' -f 1) ``` Sample output: ```log ... [1,0]:test-nccl-worker-0:22:22 [0] NCCL INFO Using network Socket ... [1,0]:# out-of-place in-place [1,0]:# size count type redop root time algbw busbw #wrong time algbw busbw #wrong [1,0]:# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) [1,0]: 0 0 float sum -1 3.91 0.00 0.00 0 3.88 0.00 0.00 0 [1,0]: 0 0 float sum -1 3.91 0.00 0.00 0 3.91 0.00 0.00 0 [1,0]: 4 1 float sum -1 98.77 0.00 0.00 0 99.81 0.00 0.00 0 [1,0]: 8 2 float sum -1 101.4 0.00 0.00 0 101.8 0.00 0.00 0 [1,0]: 16 4 float sum -1 105.3 0.00 0.00 0 103.9 0.00 0.00 0 [1,0]: 32 8 float sum -1 106.3 0.00 0.00 0 106.7 0.00 0.00 0 [1,0]: 64 16 float sum -1 107.9 0.00 0.00 0 108.8 0.00 0.00 0 [1,0]: 128 32 float sum -1 109.5 0.00 0.00 0 109.4 0.00 0.00 0 [1,0]: 256 64 float sum -1 110.5 0.00 0.00 0 109.1 0.00 0.00 0 [1,0]: 512 128 float sum -1 112.9 0.00 0.00 0 115.3 0.00 0.00 0 [1,0]: 1024 256 float sum -1 109.0 0.01 0.01 0 109.0 0.01 0.01 0 [1,0]: 2048 512 float sum -1 109.5 0.02 0.02 0 113.1 0.02 0.02 0 [1,0]: 4096 1024 float sum -1 106.5 0.04 0.04 0 106.0 0.04 0.04 0 [1,0]: 8192 2048 float sum -1 111.6 0.07 0.07 0 111.6 0.07 0.07 0 [1,0]: 16384 4096 float sum -1 155.8 0.11 0.11 0 152.7 0.11 0.11 0 [1,0]: 32768 8192 float sum -1 194.3 0.17 0.17 0 191.1 0.17 0.17 0 [1,0]: 65536 16384 float sum -1 262.7 0.25 0.25 0 259.6 0.25 0.25 0 [1,0]: 131072 32768 float sum -1 290.4 0.45 0.45 0 291.7 0.45 0.45 0 [1,0]: 262144 65536 float sum -1 383.0 0.68 0.68 0 387.5 0.68 0.68 0 [1,0]: 524288 131072 float sum -1 542.4 0.97 0.97 0 537.7 0.98 0.98 0 [1,0]: 1048576 262144 float sum -1 806.7 1.30 1.30 0 804.9 1.30 1.30 0 [1,0]: 2097152 524288 float sum -1 1384.5 1.51 1.51 0 1359.4 1.54 1.54 0 [1,0]: 4194304 1048576 float sum -1 2440.7 1.72 1.72 0 2400.4 1.75 1.75 0 [1,0]: 8388608 2097152 float sum -1 4656.9 1.80 1.80 0 4722.8 1.78 1.78 0 [1,0]: 16777216 4194304 float sum -1 9082.5 1.85 1.85 0 8985.6 1.87 1.87 0 [1,0]: 33554432 8388608 float sum -1 17301 1.94 1.94 0 17754 1.89 1.89 0 [1,0]: 67108864 16777216 float sum -1 35629 1.88 1.88 0 35327 1.90 1.90 0 [1,0]: 134217728 33554432 float sum -1 67134 2.00 2.00 0 70910 1.89 1.89 0 [1,0]: 268435456 67108864 float sum -1 133878 2.01 2.01 0 134353 2.00 2.00 0 [1,0]: 536870912 134217728 float sum -1 263971 2.03 2.03 0 267808 2.00 2.00 0 [1,0]: 1073741824 268435456 float sum -1 536774 2.00 2.00 0 541545 1.98 1.98 0 [1,0]:# Out of bounds values : 0 OK [1,0]:# Avg bus bandwidth : 0.733914 ``` 7.4.2. Execute test with EFA enabled This test assumes that the EFA device plugin has been deployed to the cluster. If not, execute `cd /eks/deployment/efa-device-plugin; ./deploy.sh`. ```bash cd /eks/deployment/efa-device-plugin kubectl delete mpijob --all kubectl apply -f ./test-nccl-efa-mount.yaml ``` The worker pods will be in status `Pending` and the launcher pod will be in status `CrashLoopBackOff` until Karpenter adds new nodes to the cluster and the nodes become `Ready`, then the worker and launcher pods will enter the `Running state. When the launcher pod is in `Running` or `Completed` state extract the pod logs to review the test results. ```bash kubectl logs -f $(kubectl get pods | grep launcher | cut -d ' ' -f 1) ``` Sample output: ```log ... [1,1]:test-nccl-efa-worker-1:22:22 [0] NCCL INFO NET/OFI Selected Provider is efa [1,1]:test-nccl-efa-worker-1:22:22 [0] NCCL INFO Using network AWS Libfabric ... [1,0]:# out-of-place in-place [1,0]:# size count type redop root time algbw busbw #wrong time algbw busbw #wrong [1,0]:# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) [1,0]: 0 0 float sum -1 6.28 0.00 0.00 0 6.35 0.00 0.00 0 [1,0]: 0 0 float sum -1 6.28 0.00 0.00 0 6.28 0.00 0.00 0 [1,0]: 4 1 float sum -1 70.19 0.00 0.00 0 69.70 0.00 0.00 0 [1,0]: 8 2 float sum -1 70.54 0.00 0.00 0 70.23 0.00 0.00 0 [1,0]: 16 4 float sum -1 70.81 0.00 0.00 0 70.82 0.00 0.00 0 [1,0]: 32 8 float sum -1 69.52 0.00 0.00 0 69.84 0.00 0.00 0 [1,0]: 64 16 float sum -1 70.07 0.00 0.00 0 70.48 0.00 0.00 0 [1,0]: 128 32 float sum -1 70.93 0.00 0.00 0 70.21 0.00 0.00 0 [1,0]: 256 64 float sum -1 70.64 0.00 0.00 0 71.08 0.00 0.00 0 [1,0]: 512 128 float sum -1 70.94 0.01 0.01 0 70.24 0.01 0.01 0 [1,0]: 1024 256 float sum -1 72.98 0.01 0.01 0 72.76 0.01 0.01 0 [1,0]: 2048 512 float sum -1 73.31 0.03 0.03 0 73.65 0.03 0.03 0 [1,0]: 4096 1024 float sum -1 75.97 0.05 0.05 0 75.54 0.05 0.05 0 [1,0]: 8192 2048 float sum -1 84.36 0.10 0.10 0 85.09 0.10 0.10 0 [1,0]: 16384 4096 float sum -1 91.27 0.18 0.18 0 90.67 0.18 0.18 0 [1,0]: 32768 8192 float sum -1 103.2 0.32 0.32 0 101.8 0.32 0.32 0 [1,0]: 65536 16384 float sum -1 127.9 0.51 0.51 0 126.6 0.52 0.52 0 [1,0]: 131072 32768 float sum -1 185.4 0.71 0.71 0 184.6 0.71 0.71 0 [1,0]: 262144 65536 float sum -1 363.2 0.72 0.72 0 362.4 0.72 0.72 0 [1,0]: 524288 131072 float sum -1 719.2 0.73 0.73 0 714.4 0.73 0.73 0 [1,0]: 1048576 262144 float sum -1 1433.8 0.73 0.73 0 1421.9 0.74 0.74 0 [1,0]: 2097152 524288 float sum -1 1099.4 1.91 1.91 0 1088.7 1.93 1.93 0 [1,0]: 4194304 1048576 float sum -1 1933.6 2.17 2.17 0 1897.6 2.21 2.21 0 [1,0]: 8388608 2097152 float sum -1 3500.0 2.40 2.40 0 3481.5 2.41 2.41 0 [1,0]: 16777216 4194304 float sum -1 6632.0 2.53 2.53 0 6800.9 2.47 2.47 0 [1,0]: 33554432 8388608 float sum -1 12997 2.58 2.58 0 12751 2.63 2.63 0 [1,0]: 67108864 16777216 float sum -1 24521 2.74 2.74 0 24332 2.76 2.76 0 [1,0]: 134217728 33554432 float sum -1 48033 2.79 2.79 0 47884 2.80 2.80 0 [1,0]: 268435456 67108864 float sum -1 94498 2.84 2.84 0 94708 2.83 2.83 0 [1,0]: 536870912 134217728 float sum -1 189806 2.83 2.83 0 189397 2.83 2.83 0 [1,0]: 1073741824 268435456 float sum -1 378919 2.83 2.83 0 381119 2.82 2.82 0 [1,0]:# Out of bounds values : 0 OK [1,0]:# Avg bus bandwidth : 0.960456 ``` ## Driver versions and compatibility [Elastic Fabric Adapter (EFA)](https://aws.amazon.com/hpc/efa/) relies on [AWS OFI NCCL](https://github.com/aws/aws-ofi-nccl) which allows use of [libfabric](https://github.com/ofiwg/libfabric) as a network provider when running [NVIDIA NCCL](https://github.com/NVIDIA/nccl) applications. Compatibility information of AWS OFI NCCL with libfabric and NCCL versions is available in the [AWS OFI NCCL Release Notes](https://github.com/aws/aws-ofi-nccl/releases). Also compatibility information of NCCL with versions of CUDA is available in the [NCCL Documentation](https://docs.nvidia.com/deeplearning/nccl/release-notes/rel_2-14-3.html#rel_2-14-3). Frameworks like PyTorch, TensorFlow, etc. also need to be compatible with CUDA and NVIDIA Driver as detailed in the [Framework Containers Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html). Versions of drivers and frameworks that are used in containers, should be compatible with those installed natively on the host machines where these containers run. A sample project for building base Amazon Machine Images (AMI) and base Docker container images, enabled with EFA is available in the [aws-efa-nccl-baseami-pipeline](https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline) project on GitHub. A [do-framework](https://bit.ly/do-framework) project for building a demo container image with EFA is included in subfolder [cuda-efa-nccl-tests](./cuda-efa-nccl-tests). Execute `./build.sh` to create a compatible container image, then `./push.sh` to upload the image to your private Elastic Container Registry on AWS. ## References * Karpenter Documentation: [https://karpenter.sh/v0.10.1/](https://karpenter.sh/v0.10.1/) * Provisioner CRD: [https://karpenter.sh/v0.10.0/provisioner/](https://karpenter.sh/v0.10.0/provisioner/) * Scheduling: [https://karpenter.sh/v0.10.1/tasks/scheduling/](https://karpenter.sh/v0.10.1/tasks/scheduling/) * Kaprenter Best Practices: [https://aws.github.io/aws-eks-best-practices/karpenter/](https://aws.github.io/aws-eks-best-practices/karpenter/) * Elastic Fabric Adapter: [https://aws.amazon.com/hpc/efa/](https://aws.amazon.com/hpc/efa/) * EFA - Supported instance types: [https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types)