{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Kubeflow Fairing Kaniko Cloud Builder on AWS\n", "\n", "## Requirements\n", "\n", " * You must be running Kubeflow 1.0 or newer on EKS\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create AWS secret in kubernetes and grant aws access to your notebook\n", "\n", "> Note: Once IAM for Service Account is merged in 1.0.1, we don't have to use credentials\n", "\n", "1. Please create an AWS secret in current namespace. \n", "\n", "> Note: To get base64 string, try `echo -n $AWS_ACCESS_KEY_ID | base64`. \n", "> Make sure you have `AmazonEC2ContainerRegistryFullAccess` and `AmazonS3FullAccess` for this experiment. Pods will use credentials to talk to AWS services." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "\n", "# Replace placeholder with your own AWS credentials\n", "AWS_ACCESS_KEY_ID=''\n", "AWS_SECRET_ACCESS_KEY=''\n", "\n", "kubectl create secret generic aws-secret --from-literal=AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} --from-literal=AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Attach `AmazonEC2ContainerRegistryFullAccess` and `AmazonS3FullAccess` to EKS node group role and grant AWS access to notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Verify you have access to AWS services\n", "\n", "* The cell below checks that this notebook was spawned with credentials to access AWS S3 and ECR" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import logging\n", "import os\n", "import uuid\n", "from importlib import reload\n", "import boto3\n", "\n", "# Set REGION for s3 bucket and elastic contaienr registry\n", "AWS_REGION='us-west-2'\n", "boto3.client('s3', region_name=AWS_REGION).list_buckets()\n", "boto3.client('ecr', region_name=AWS_REGION).describe_repositories()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install Required Libraries\n", "\n", "Import the libraries required to train this model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from src.kaniko import notebook_setup\n", "reload(notebook_setup)\n", "notebook_setup.notebook_setup()\n", "\n", "# Force a reload of kubeflow; since kubeflow is a multi namespace module\n", "# it looks like doing this in notebook_setup may not be sufficient\n", "import kubeflow\n", "reload(kubeflow)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configure ECR Docker Registry For Kubeflow Fairing\n", "\n", "* In order to build docker images from your notebook we need a docker registry where the images will be stored\n", "* Below you set some variables specifying a [Amazon Elastic Container Registry](https://aws.amazon.com/ecr/)\n", "* Kubeflow Fairing provides a utility function to guess the name of your AWS account" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from kubeflow import fairing \n", "from kubeflow.fairing import utils as fairing_utils\n", "from kubeflow.fairing.preprocessors import base as base_preprocessor\n", "from kubeflow.tfjob.api import tf_job_client as tf_job_client_module\n", "import yaml\n", "\n", "# Setting up AWS Elastic Container Registry (ECR) for storing output containers\n", "# You can use any docker container registry istead of ECR\n", "AWS_ACCOUNT_ID=fairing.cloud.aws.guess_account_id()\n", "AWS_ACCOUNT_ID = boto3.client('sts').get_caller_identity().get('Account')\n", "DOCKER_REGISTRY = '{}.dkr.ecr.{}.amazonaws.com'.format(AWS_ACCOUNT_ID, AWS_REGION)\n", "\n", "namespace = fairing_utils.get_current_k8s_namespace()\n", "\n", "logging.info(f\"Running in aws region {AWS_REGION}, account {AWS_ACCOUNT_ID}\")\n", "logging.info(f\"Running in namespace {namespace}\")\n", "logging.info(f\"Using docker registry {DOCKER_REGISTRY}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use Kubeflow fairing to build the docker image\n", "\n", "* You will use kubeflow fairing's kaniko builder to build a docker image that includes all your dependencies\n", " * You use kaniko because you want to be able to run `pip` to install dependencies\n", " * Kaniko gives you the flexibility to build images from Dockerfiles" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from kubeflow.fairing.builders import cluster\n", "\n", "# output_map is a map of extra files to add to the notebook.\n", "# It is a map from source location to the location inside the context.\n", "output_map = {\n", " \"./src/kaniko/Dockerfile.model\": \"Dockerfile\",\n", " \"./src/kaniko/model.py\": \"model.py\"\n", "}\n", "\n", "preprocessor = base_preprocessor.BasePreProcessor(\n", " command=[\"python\"], # The base class will set this.\n", " input_files=[],\n", " path_prefix=\"/app\", # irrelevant since we aren't preprocessing any files\n", " output_map=output_map)\n", "\n", "preprocessor.preprocess()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a new ECR repository to host model image\n", "ecr_repo_name = 'fairing-kaniko'\n", "!aws ecr create-repository --repository-name $ecr_repo_name --region=$AWS_REGION" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use a Tensorflow image as the base image\n", "# We use a custom Dockerfile \n", "cluster_builder = cluster.cluster.ClusterBuilder(registry=DOCKER_REGISTRY,\n", " base_image=\"\", # base_image is set in the Dockerfile\n", " preprocessor=preprocessor,\n", " image_name=\"fairing-kaniko\",\n", " dockerfile_path=\"Dockerfile\",\n", " pod_spec_mutators=[fairing.cloud.aws.add_aws_credentials_if_exists, fairing.cloud.aws.add_ecr_config],\n", " context_source=cluster.s3_context.S3ContextSource(region=AWS_REGION))\n", "cluster_builder.build()\n", "logging.info(f\"Built image {cluster_builder.image_tag}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a S3 Bucket\n", "\n", "* Create a S3 bucket to store our models and other results.\n", "* Since we are running in python we use the python client libraries." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "from botocore.exceptions import ClientError\n", "\n", "bucket = f\"{AWS_ACCOUNT_ID}-fairing-kaniko\"\n", "\n", "def create_bucket(bucket_name):\n", " \"\"\"Create an S3 bucket in a specified region\n", "\n", " :param bucket_name: Bucket to create\n", " :return: True if bucket created, else False\n", " \"\"\"\n", "\n", " # Create bucket\n", " try:\n", " s3_client = boto3.client('s3')\n", " s3_client.create_bucket(Bucket=bucket_name)\n", " except ClientError as e:\n", " logging.error(e)\n", " return False\n", " return True\n", "\n", "create_bucket(bucket)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Distributed training by using Fairing Kaniko image\n", "\n", "* We will train the model by using TFJob to run a distributed training job" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from src.kaniko.tfjob_spec_provider import tfj_spec \n", "train_name=f\"mnist-train-{uuid.uuid4().hex[:4]}\"\n", "train_spec = tfj_spec(train_name=train_name,\n", " num_ps=1,\n", " num_workers=2,\n", " model_dir=f\"s3://{bucket}/mnist\",\n", " export_path=f\"s3://{bucket}/mnist/export\",\n", " train_steps=200,\n", " batch_size=100,\n", " learning_rate=.01,\n", " image=cluster_builder.image_tag,\n", " AWS_REGION=AWS_REGION)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the training job from Kaniko Image\n", "\n", "* You could write the spec to a YAML file and then do `kubectl apply -f {FILE}`\n", "* Since you are running in jupyter you will use the TFJob client\n", "* You will run the TFJob in a namespace created by a Kubeflow profile\n", " * The namespace will be the same namespace you are running the notebook in\n", " * Creating a profile ensures the namespace is provisioned with service accounts and other resources needed for Kubeflow" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tf_job_client = tf_job_client_module.TFJobClient()\n", "tf_job_body = yaml.safe_load(train_spec)\n", "tf_job = tf_job_client.create(tf_job_body, namespace=namespace) \n", "\n", "logging.info(f\"Created job {namespace}.{train_name}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check the job\n", "\n", "* Above you used the python SDK for TFJob to check the status\n", "* You can also use kubectl get the status of your job\n", "* The job conditions will tell you whether the job is running, succeeded or failed" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!kubectl get tfjobs -o yaml {train_name}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clean Up" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "!aws ecr delete-repository --repository-name $ecr_repo_name --force --region=$AWS_REGION\n", "!aws s3 rb s3://$bucket --force" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" }, "pycharm": { "stem_cell": { "cell_type": "raw", "source": [], "metadata": { "collapsed": false } } } }, "nbformat": 4, "nbformat_minor": 4 }