# Create Amazon EKS cluster with GPU-enabled workers and Kubeflow This document explains how to create an Amazon EKS cluster with GPU-enabled workers. This documentation is from official [Kubeflow on AWS](https://www.kubeflow.org/docs/aws/customizing-aws/) documentation. Please check website for more details for Kubeflow on AWS. If you meet any problem during installation, please check [Troubleshooting Deployments on Amazon EKS](https://www.kubeflow.org/docs/aws/troubleshooting-aws/) ## Prerequisites * Install [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/#install-kubectl) * Install and configure the AWS Command Line Interface (AWS CLI): * Install the [AWS Command Line Interface](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html). * Configure the AWS CLI by running the following command: `aws configure`. * Enter your Access Keys ([Access Key ID and Secret Access Key](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys)). * Enter your preferred AWS Region and default output options. * Install [eksctl](https://github.com/weaveworks/eksctl) (version 0.1.27 or newer). * Install [jq](https://stedolan.github.io/jq/download/). * Install [ksonnet](https://github.com/ksonnet/ksonnet). (`brew install ksonnet/tap/ks` for mac user) ## Create cluster and install Kubeflow 1. Subscribe to the GPU supported AMI: https://aws.amazon.com/marketplace/pp/B07GRHFXGM 1. Run the following commands to download the latest `kfctl.sh`: ``` export KUBEFLOW_SRC=~/tmp/kubeflow-aws export KUBEFLOW_TAG=v0.5-branch mkdir -p ${KUBEFLOW_SRC} && cd ${KUBEFLOW_SRC} curl https://raw.githubusercontent.com/kubeflow/kubeflow/${KUBEFLOW_TAG}/scripts/download.sh | bash ``` * `/tmp/kubeflow-aws` is full path to your preferred download directory. 1. Run the following commands to set up your environment and initialize the cluster. ``` export KFAPP=kfapp export REGION=us-west-2 export AWS_CLUSTER_NAME=kubeflow-aws ${KUBEFLOW_SRC}/scripts/kfctl.sh init ${KFAPP} --platform aws \ --awsClusterName ${AWS_CLUSTER_NAME} \ --awsRegion ${REGION} ``` * `AWS_CLUSTER_NAME` - A unique name for your Amazon EKS cluster. * `KFAPP` - Use a relative directory name here rather than absolute path, such as `kfapp`. * `REGION` - Use the AWS Region you want to create your cluster in. 1. Generate and apply platform changes. You can customize your cluster configuration, control plane logging, and private cluster endpoint access before you `apply platform`, please see [Customizing Kubeflow on AWS](https://www.kubeflow.org/docs/aws/customizing-aws/) for more information. ```shell cd ${KFAPP} ${KUBEFLOW_SRC}/scripts/kfctl.sh generate platform # Customize your Amazon EKS cluster configuration before following the next step ``` 1. Open `cluster_config.yaml` and update the file so that it looks like as shown below. This file can also be copied from the [repo](../samples/kubeflow/cluster_config.yaml). ``` apiVersion: eksctl.io/v1alpha5 kind: ClusterConfig metadata: # AWS_CLUSTER_NAME and AWS_REGION will override `name` and `region` here. name: kubeflow-aws region: us-west-2 version: '1.12' # If your region has multiple availability zones, you can specify 3 of them. #availabilityZones: ["us-west-2b", "us-west-2c", "us-west-2d"] # NodeGroup holds all configuration attributes that are specific to a nodegroup # You can have several node group in your cluster. nodeGroups: #- name: cpu-nodegroup # instanceType: m5.2xlarge # desiredCapacity: 1 # minSize: 0 # maxSize: 2 # volumeSize: 30 # Example of GPU node group - name: Tesla-V100 instanceType: p3.8xlarge availabilityZones: ["us-west-2b"] desiredCapacity: 2 minSize: 0 maxSize: 2 volumeSize: 50 ssh: allow: true publicKeyPath: '~/.ssh/id_rsa.pub' ``` Then apply the changes: ```shell # vim ${KUBEFLOW_SRC}/${KFAPP}/aws_config/cluster_config.yaml ${KUBEFLOW_SRC}/scripts/kfctl.sh apply platform ``` 1. Generate and apply the Kubernetes changes. ```shell ${KUBEFLOW_SRC}/scripts/kfctl.sh generate k8s ``` __*Important!!!*__ By default, these scripts create an AWS Application Load Balancer for Kubeflow that is open to public. This is good for development testing and for short term use, but we do not recommend that you use this configuration for production workloads. To secure your installation, you have two options: * Disable ingress before you `apply k8s`. Open `${KUBEFLOW_SRC}/${KFAPP}/env.sh` and edit the `KUBEFLOW_COMPONENTS` environment variable. Delete `,\"alb-ingress-controller\",\"istio-ingress\"` and save the file. * Follow the [instructions](https://www.kubeflow.org/docs/aws/authentication/) to add authentication before you `apply k8s` Once your customization is done or if you're fine to have a public endpoint for testing, you can run this command to deploy Kubeflow. ```shell ${KUBEFLOW_SRC}/scripts/kfctl.sh apply k8s ``` This will take a few minutes for all pods get ready. 1. Get memory, CPU and GPU for each node in the cluster: ``` kubectl get nodes "-o=custom-columns=NAME:.metadata.name,MEMORY:.status.allocatable.memory,CPU:.status.allocatable.cpu,GPU:.status.allocatable.nvidia\.com/gpu" ``` Shows something like: ``` NAME MEMORY CPU GPU ip-192-168-101-177.us-west-2.compute.internal 251643680Ki 32 4 ip-192-168-196-254.us-west-2.compute.internal 251643680Ki 32 4 ``` The maximum number of GPUs that may be scheduled to a pod is capped by the number of GPUs available per node. By default, pods are scheduled on CPU. 1. Verify kubeflow: ``` kubectl get pods -n=kubeflow NAME READY STATUS RESTARTS AGE ambassador-5cf8cd97d5-68xgv 1/1 Running 0 19m ambassador-5cf8cd97d5-cxp85 1/1 Running 0 19m ambassador-5cf8cd97d5-r57hc 1/1 Running 0 19m argo-ui-7c9c69d464-p7mjh 1/1 Running 0 17m centraldashboard-6f47d694bd-qd56p 1/1 Running 0 18m jupyter-0 1/1 Running 0 18m katib-ui-6bdb7d76cc-z9dv5 1/1 Running 0 16m metacontroller-0 1/1 Running 0 17m minio-7bfcc6c7b9-d8xqf 1/1 Running 0 17m ml-pipeline-6fdd759597-n9zws 1/1 Running 0 17m ml-pipeline-persistenceagent-5669f69cdd-gdzwq 1/1 Running 1 16m ml-pipeline-scheduledworkflow-9f6d5d5b6-zfqtd 1/1 Running 0 16m ml-pipeline-ui-67f79b964d-jrwx6 1/1 Running 0 16m mysql-6f6b5f7b64-bg2vp 1/1 Running 0 17m pytorch-operator-6f87db67b7-4bksz 1/1 Running 0 17m spartakus-volunteer-6f5f47f95-cl5sf 1/1 Running 0 17m studyjob-controller-774d45f695-2k6t8 1/1 Running 0 16m tf-job-dashboard-5f986cf99d-nqxm7 1/1 Running 0 18m tf-job-operator-v1beta1-5876c48976-zbmrq 1/1 Running 0 18m vizier-core-fc7969897-rns98 1/1 Running 1 16m vizier-core-rest-6fcd4665d9-bf69g 1/1 Running 0 16m vizier-db-777675b958-kt8p2 1/1 Running 0 16m vizier-suggestion-bayesianoptimization-54db8d594f-5srk6 1/1 Running 0 16m vizier-suggestion-grid-6f5d9d647f-tzgqw 1/1 Running 0 16m vizier-suggestion-hyperband-59dd9bb9bc-mldbl 1/1 Running 0 16m vizier-suggestion-random-6dd597c997-8qdnr 1/1 Running 0 16m workflow-controller-5c95f95f58-nvg9l 1/1 Running 0 17m ``` You may not need all the componets. If you want to customize components, please check [Kubeflow Customization](kubeflow-custom.md). 1. Once done playing, uninstall KubeFlow and cluster. ``` cd ${KUBEFLOW_SRC}/${KFAPP} ${KUBEFLOW_SRC}/scripts/kfctl.sh delete all ```