{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Container Basics - with CI/CD\n", "\n", "This notebook runs on conda_python3 kernel. \n", "\n", "In container-basic notebook, we used the command line to create, build and run a few containers on the same instance where we run the notebook. \n", "\n", "In this notebook, we will use AWS CodeBuild to demonstrate how you can use a CI/CD pipeline to build and run containers on an AWS managed service - AWS Batch.\n", "\n", "**Note**: After you create a service-role, most of the time, the role will take effect immmediately. However, sometimes it takes a few minutes to propagate. If you see an IAM error, wait a few minutes and try again. \n", "\n", "Similarly with the container-basic notebook, we start with the boto3 SDK and create our session and an output bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#You only need to do this once per kernel to analyze the fastq data. If you don't want to run the last step, you can skip this.\n", "!pip install bioinfokit" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import botocore\n", "import json\n", "import time\n", "import os\n", "import project_path # path to helper methods\n", "import importlib\n", "from lib import workshop\n", "from botocore.exceptions import ClientError\n", "from IPython.display import display, clear_output\n", "\n", "# create a bucket for the workshop to store output files. \n", "session = boto3.session.Session()\n", "\n", "region_name = session.region_name\n", "account_id = boto3.client('sts').get_caller_identity().get('Account')\n", "\n", "proj_name = 'mysra-cicd-pipeline'\n", "image_tag = 'mySRATools'\n", "\n", "# we will use this bucket for some artifacts and the output of sratools. \n", "bucket = workshop.create_bucket(region_name, session, f\"container-ws-{account_id}\", False)\n", "print(bucket)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# same helper magic for Jupyter to create files more easily\n", "from IPython.core.magic import register_line_cell_magic\n", "\n", "@register_line_cell_magic\n", "def writetemplate(line, cell):\n", " with open(line, 'w+') as f:\n", " f.write(cell.format(**globals()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Automate the build process of SRA-Tools container\n", "\n", "We will re-use the customized container from \"Container Basic\" notebook and continue the genomics use case with the NCBI SRA (Sequence Read Archive) SRA Tool (https://github.com/ncbi/sra-tools) and fasterq-dump (https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump) to extract fastq from SRA-accessions.\n", "\n", "The command takes a package name as an argument:\n", "```\n", "$ fasterq-dump SRR000001\n", "```\n", "\n", "We will use the base Ubuntu image and install sra-tools (https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.10.0/sratoolkit.2.10.0-ubuntu64.tar.gz) \n", "\n", "The workflow of the program in the contianer: \n", "1. Upon start, container runs a script \"sratest.sh\".\n", "3. sratest.sh will \"prefetch\" the data package, whose name is passed via an environment variable. \n", "4. sratest.sh then run \"fasterq-dump\" on the data package\n", "5. sratest.sh will then upload the result to S3://{bucket}\n", "\n", "The output of the fasterq-dump will be stored in s3://{bucket}/data/sra-toolkit/fasterq/{PACKAGE_NAME}\n", "\n", "## AWS CodePipeline\n", "\n", "In the \"Container Basics\" notebook we built and ran the container on the same instance that runs the notebook. Here we will use AWS CodePipeline to automate a CI/CD process for our tools.\n", "\n", "The AWS CodePipeline consists of the following components:\n", "\n", "1. CodeCommit - code repository, which will contain a \"buildspec.yml\", \"Dockerfile\", and all files needed for the container.\n", "2. CodeBuild - this will spawn an instance to run the \"docker build\" and \"push\" the image to Amazon ECR.\n", "3. CodeDeploy - we will not use CodeDeploy in this notebook.\n", "\n", "Each time we checkin code to CodeCommit, it will trigger the entire CodePipeline. When the pipeline finishes, we will have a new version of the docker image in the container registry (Amazon ECR). This allows downstream recipients of our image to take advantage of our work without having to understand the internal details of our container." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "PACKAGE_NAME='SRR000002'\n", "\n", "# this is where the output will be stored\n", "sra_prefix = 'data/sra-toolkit/fasterq'\n", "sra_output = f\"s3://{bucket}/{sra_prefix}\"\n", "\n", "print(sra_output)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 1. Create the run script\n", "This script will be executed via the RUN command in our container. \n", "1. It will fetch the sra package by package name\n", "2. run fasterq-dump on the package data \n", "3. copy the output to S3\n", "\n", "Note: for teaching purposes, we will be using a different syntax (\"\"\") in this notebook to prepare our files. See Step 4 for the context where we use them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sratest_content = \"\"\"#!/bin/bash\n", "set -x\n", "\n", "prefetch $PACKAGE_NAME --output-directory /tmp\n", "fasterq-dump $PACKAGE_NAME -e 8\n", "aws s3 sync . $SRA_OUTPUT/$PACKAGE_NAME\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2. Create our own Docker image file\n", "\n", "Let's build our own image using a Ubuntu base image. \n", "\n", "1. Install tzdata - as before, we will install this dependency explicitly (with -y argument) to avoid the timezone prompt that would halt the docker build process\n", "2. Install wget and awscli.\n", "3. Download sratookit ubuntu64 binary and unzip into /opt\n", "4. Set the PATH to include the latest sratoolkit/bin\n", "5. USER nobody is needed to set the permission for sratookit configuration. \n", "6. Use the same sratest.sh script \n", "\n", "Note: As of Nov 2020, DockerHub has set request limits on their public repos and you might get throttled if you use DockerHub's base image. Therefore, in this example, we will use the base Ubuntu image from AWS public docker registry." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dockerfile_content=\"\"\"FROM public.ecr.aws/ubuntu/ubuntu:latest\n", "\n", "RUN apt-get update \n", "RUN DEBIAN_FRONTEND=\"noninteractive\" apt-get -y install tzdata \n", "RUN apt-get install -y curl wget libxml-libxml-perl awscli uuid-runtime\n", " \n", "RUN wget -q https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz -O /tmp/sratoolkit.tar.gz \\\n", " && tar zxf /tmp/sratoolkit.tar.gz -C /opt/ && rm /tmp/sratoolkit.tar.gz && \\\n", " ln -s /opt/sratoolkit.$(curl -s \"https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current.version\")-ubuntu64 /opt/sratoolkit\n", " \n", "ENV PATH=\"/opt/sratoolkit/bin/:${PATH}\"\n", "ADD sratest.sh /usr/local/bin/sratest.sh\n", "RUN chmod +x /usr/local/bin/sratest.sh\n", "RUN mkdir /tmp/.ncbi && printf '/LIBS/GUID = \"%s\"\\\\n' `uuidgen` > /tmp/.ncbi/user-settings.mkfg\n", "\n", "ADD filelist.txt /tmp/filelist.txt\n", "ENV HOME=/tmp\n", "WORKDIR /tmp\n", "USER nobody\n", "ENTRYPOINT [\"/usr/local/bin/sratest.sh\"]\n", "\"\"\"\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3. Create the build spec file\n", "We will be using the AWS CodeBuild to create our docker image and store it into AWS ECR (private docker image registry). Thus, we need to create a build specifications for this service, such that it knows what commands to run to setup our build environment (pre_build), do the actual build (build section), and then push the image to AWS ECR (post_build). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "buildspec_content =\"\"\"version: 0.2\n", "\n", "phases:\n", " pre_build:\n", " commands:\n", " - echo Logging in to Amazon ECR...\n", " - aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com\n", " build:\n", " commands:\n", " - echo Build started on `date`\n", " - echo Building the Docker image... \n", " - docker build -t $IMAGE_REPO_NAME:$IMAGE_TAG -f Dockerfile.cicd . \n", " - docker tag $IMAGE_REPO_NAME:$IMAGE_TAG $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG \n", " post_build:\n", " commands:\n", " - echo Build completed on `date`\n", " - echo Pushing the Docker image...\n", " - docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG\n", "\"\"\"\n", "\n", "#place holder for later use , add in container so we don't have to change the docker file\n", "file_list_content = \"\"\"\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 4. Create an ECR repo\n", "Before we can actually build our image, we need to have the repository referenced in our (post_build) phase. We will use boto3 again to interact with the AWS ECR APIs. We will actually use the repository in Step 6, after the container image is built. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ecr_client = boto3.client('ecr')\n", "try:\n", " resp = ecr_client.create_repository(repositoryName=proj_name)\n", "except ClientError as e:\n", " if e.response['Error']['Code'] == 'RepositoryAlreadyExistsException':\n", " print(f\"ECR Repo {proj_name} already exists, skip\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 5. Create an AWS CodeCommit repo and checkin the files\n", "\n", "We start by setting up the proper access permissions using the IAM service. Each service (CodePipeline, CodeBuild) needs its own policies. We also need to allow these services to access other related services on our behalf (S3 and ECR).\n", "\n", "Note the CodePipeline and CodeBuild role ARNs that we will use in Step 6." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# roleArn:\n", "iam_client = session.client('iam')\n", "\n", "codepipeline_service_role_name = f\"{proj_name}-codepipeline-service-role\"\n", "codepipeline_policies = ['arn:aws:iam::aws:policy/AWSCodePipeline_FullAccess', \n", " 'arn:aws:iam::aws:policy/AWSCodeCommitFullAccess',\n", " 'arn:aws:iam::aws:policy/AmazonS3FullAccess',\n", " 'arn:aws:iam::aws:policy/AWSCodeBuildAdminAccess'\n", " ]\n", "codepipeline_role_arn = workshop.create_service_role_with_policies(codepipeline_service_role_name, 'codepipeline.amazonaws.com', codepipeline_policies )\n", "print(codepipeline_role_arn)\n", " \n", "codebuild_service_role_name = f\"{proj_name}-codebuild-service-role\"\n", "codebuild_policies = ['arn:aws:iam::aws:policy/AWSCodeBuildAdminAccess',\n", " 'arn:aws:iam::aws:policy/CloudWatchFullAccess',\n", " 'arn:aws:iam::aws:policy/AmazonS3FullAccess',\n", " 'arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess']\n", "codebuild_role_arn = workshop.create_service_role_with_policies(codebuild_service_role_name, 'codebuild.amazonaws.com', codebuild_policies )\n", "print(codebuild_role_arn)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's prepare our files for the initial checkin. We have the Dockerfile, sratest script, buildspec and filelist (empty for now)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# prepare the files for the checkin\n", "put_files=[{\n", " 'filePath': 'Dockerfile.cicd',\n", " 'fileContent': dockerfile_content\n", " },\n", " {\n", " 'filePath': 'sratest.sh',\n", " 'fileContent': sratest_content\n", " },\n", " {\n", " 'filePath': 'filelist.txt',\n", " 'fileContent': file_list_content\n", " },\n", " {\n", " 'filePath': 'buildspec.yml',\n", " 'fileContent': buildspec_content\n", " }]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are now ready for the first commit. We will create our code repository and upload our files into the \"main\" branch." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " \n", "codecommit_client = boto3.client('codecommit')\n", "try:\n", " resp = codecommit_client.create_repository(repositoryName=proj_name)\n", "except ClientError as e:\n", " if e.response['Error']['Code'] == 'RepositoryNameExistsException':\n", " print(f\"Repo {proj_name} exists, use that one\")\n", "\n", "try:\n", " resp = codecommit_client.get_branch(repositoryName=proj_name, branchName='main')\n", " parent_commit_id = resp['branch']['commitId']\n", "except ClientError as e:\n", " if e.response['Error']['Code'] == 'BranchDoesNotExistException':\n", " # the repo is new, create it \n", " workshop.commit_files(proj_name, \"main\", put_files, None)\n", "else:\n", " try:\n", " resp = workshop.commit_files(proj_name, \"main\",put_files, parent_commit_id)\n", " except ClientError as ee:\n", " if ee.response['Error']['Code'] == 'NoChangeException':\n", " print('No change detected. skip commit')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 6. Create a CodeBuild project\n", "\n", "The second stage of the CI/CD pipeline is the build process. We use an instance managed by AWS (see computeType below) to build the container using a standard Amazon Linux 2 build environment. The CodeBuild process is triggered by the CodeCommit code checkins. \n", "\n", "Note: **codebuild-service-role takes a little longer to propagate**. If you see a permission error, please retry again in a minute.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "codebuild_client = boto3.client('codebuild')\n", "codebuild_name = f\"Build-{proj_name}\" \n", "codecommit_name = f\"Source-{proj_name}\"\n", "try: \n", " resp = codebuild_client.create_project(name=codebuild_name, \n", " description=\"CICD workshop build demo\",\n", " source= {\n", " 'type': \"CODEPIPELINE\"\n", " },\n", " artifacts= {\n", " \"type\": \"CODEPIPELINE\",\n", " \"name\": proj_name\n", " },\n", " environment= {\n", " \"type\": \"LINUX_CONTAINER\",\n", " \"image\": \"aws/codebuild/amazonlinux2-x86_64-standard:3.0\",\n", " \"computeType\": \"BUILD_GENERAL1_SMALL\",\n", " \"environmentVariables\": [\n", " {\n", " \"name\": \"AWS_DEFULT_REGION\",\n", " \"value\": region_name,\n", " \"type\": \"PLAINTEXT\"\n", " },\n", " {\n", " \"name\": \"AWS_ACCOUNT_ID\",\n", " \"value\": account_id,\n", " \"type\": \"PLAINTEXT\"\n", " },\n", " {\n", " \"name\": \"IMAGE_REPO_NAME\",\n", " \"value\": proj_name,\n", " \"type\": \"PLAINTEXT\"\n", " },\n", " {\n", " \"name\": \"IMAGE_TAG\",\n", " \"value\": image_tag,\n", " \"type\": \"PLAINTEXT\"\n", " }\n", " ],\n", " \"privilegedMode\": True,\n", " \"imagePullCredentialsType\": \"CODEBUILD\" \n", " },\n", " logsConfig= {\n", " \"cloudWatchLogs\": {\n", " \"status\": \"ENABLED\",\n", " \"groupName\": proj_name\n", " },\n", " \"s3Logs\": {\n", " \"status\": \"DISABLED\"\n", " }\n", " },\n", " serviceRole= codebuild_role_arn\n", " )\n", "except ClientError as e:\n", " if e.response['Error']['Code'] == 'ResourceAlreadyExistsException':\n", " print(f\"CodeBuild project {proj_name} exists, skip...\")\n", " else:\n", " raise e\n", "\n", "\n", "print(f\"CodeBuild project name {codebuild_name}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 7. Build the AWS CodePipeline \n", "\n", "We now combine Step 5, Step 6 together into a pipeline with two stages commit and build." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "codepipeline_client = boto3.client('codepipeline')\n", "\n", "stage1 = {\n", " \"name\":f\"{codecommit_name}\",\n", " \"actions\": [\n", " {\n", " \"name\": \"Source\",\n", " \"actionTypeId\": {\n", " \"category\": \"Source\",\n", " \"owner\": \"AWS\",\n", " \"provider\": \"CodeCommit\",\n", " \"version\": \"1\"\n", " },\n", " \"runOrder\": 1,\n", " \"configuration\": {\n", " \"BranchName\": \"main\",\n", " \"OutputArtifactFormat\": \"CODE_ZIP\",\n", " \"PollForSourceChanges\": \"true\",\n", " \"RepositoryName\": proj_name\n", " },\n", " \"outputArtifacts\": [\n", " {\n", " \"name\": \"SourceArtifact\"\n", " }\n", " ],\n", " \"inputArtifacts\": [],\n", " \"region\": region_name,\n", " \"namespace\": \"SourceVariables\"\n", " }\n", " ]\n", "}\n", "\n", "stage2 = {\n", " \"name\": f\"{codebuild_name}\",\n", " \"actions\": [\n", " {\n", " \"name\": \"Build\",\n", " \"actionTypeId\": {\n", " \"category\": \"Build\",\n", " \"owner\": \"AWS\",\n", " \"provider\": \"CodeBuild\",\n", " \"version\": \"1\"\n", " },\n", " \"runOrder\": 1,\n", " \"configuration\": {\n", " \"ProjectName\": codebuild_name\n", " },\n", " \"outputArtifacts\": [\n", " {\n", " \"name\": \"BuildArtifact\"\n", " }\n", " ],\n", " \"inputArtifacts\": [\n", " {\n", " \"name\": \"SourceArtifact\"\n", " }\n", " ],\n", " \"region\": region_name,\n", " \"namespace\": \"BuildVariables\"\n", " }\n", " ] \n", "}\n", "\n", "\n", "stages = [ stage1, stage2]\n", "\n", "\n", "pipeline = {\n", " 'name': proj_name,\n", " 'roleArn': codepipeline_role_arn,\n", " 'artifactStore': {\n", " 'type': 'S3',\n", " 'location': bucket\n", " }, \n", " 'stages': stages\n", "}\n", "\n", "try:\n", " resp = codepipeline_client.create_pipeline( pipeline= pipeline)\n", " print(\"Created pipeline\",resp)\n", "except ClientError as e:\n", " if e.response['Error']['Code'] == 'PipelineNameInUseException':\n", " print(f\"Codepipeline {proj_name} already exists \" )\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 8. Check the container image in the repo\n", "\n", "Navigate to CodePipeline in the AWS Console to check the status of the step above. The initial CodePipline process will take a few minutes. It will pull assets from CodeCommit, build the docker image on a managed instance, and push the result image into ECR. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We should see a container image with the image tag \"mySRATools\" - this is defined as an environment variable in CodeBuild\n", "#resp = ecr_client.list_images(repositoryName=proj_name)\n", "while True:\n", " resp = ecr_client.describe_images(repositoryName=proj_name)\n", " if resp['imageDetails']:\n", " for image in resp['imageDetails']:\n", " print(\"image pushed at: \" + str(image['imagePushedAt']))\n", " break\n", " else:\n", " clear_output(wait=True)\n", " display(\"Build not done yet, please wait and retry this step. Please do not proceed until you see the 'image pushed' message\")\n", " time.sleep(20)\n", "# this is used later in job_definition for AWS Batch\n", "image_uri= f\"{account_id}.dkr.ecr.{region_name}.amazonaws.com/{proj_name}:{image_tag}\"\n", "print(image_uri)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Run the container on AWS\n", "\n", "Now we have our container image built using the CodePipeline and registerted in the container registry (ECR), we will next run the container on AWS managed services, including AWS Batch and Amazon EKS.\n", "\n", "AWS Batch enable you to run large scale batch computing jobs without the need to install and manage softwares or clusters. \n", "\n", "Amazon Elastic Kubernetes Servcice (EKS) make it easy for you to run Kunernetes applications. It provides highly-available and secure clusters and automates key tasks such as patching, node provisioning, and updates. \n", "\n", "In the following two sections, we will run our sratool container as jobs on both Amazon EKS and AWS Batch. \n", "\n", "## Option A. Run the container in Amazon EKS\n", "\n", "This option takes about 20 minutes to complete. If you don't have enough time in a workshop, you can skip to Option B. \n", "\n", "We will use **eksctl** CLI and kubernetes tool **kubctl** to to create an EKS cluster and interact with the cluster\n", "\n", "### Step 1. Install eksctl, kubectl and aws CLIs (Command line tools) \n", "EKS supports multiple versions of Kubernetes. **Uncomment the respective line below if you want to test a specific version.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!curl --silent --location \"https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz\" | tar xz -C /tmp\n", "!sudo mv -v /tmp/eksctl /usr/local/bin\n", "!eksctl version\n", "\n", "## Kubernetes 1.19\n", "#!sudo curl --silent --location -o /usr/local/bin/kubectl https://amazon-eks.s3.us-west-2.amazonaws.com/1.19.6/2021-01-05/bin/linux/amd64/kubectl\n", "\n", "## Kubernetes 1.20\n", "#!sudo curl --silent --location -o /usr/local/bin/kubectl https://amazon-eks.s3-us-west-2.amazonaws.com/1.20.4/2021-04-12/bin/linux/amd64/kubectl\n", "\n", "## Kubernetes 1.21\n", "!sudo curl --silent --location -o /usr/local/bin/kubectl https://amazon-eks.s3-us-west-2.amazonaws.com/1.21.2/2021-07-05/bin/linux/amd64/kubectl\n", " \n", " \n", "!sudo chmod +x /usr/local/bin/kubectl\n", "\n", "# you can ignore dependency errors when running the following command\n", "!pip install --upgrade awscli && hash -r\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2. Create the EKS cluster configuration file\n", "We will use very simple default settings for the cluster and add a node group with 2 instances. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "kubectl_ver= !kubectl version --output json 2>/dev/null | jq -r '.clientVersion.major + \".\" + .clientVersion.minor' | sed 's/\\+.*$//'\n", "kubectl_ver = kubectl_ver[0]\n", "print(kubectl_ver)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use two availability zones (a and b) in the current region." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writetemplate eksworkshop.yaml\n", "apiVersion: eksctl.io/v1alpha5\n", "kind: ClusterConfig\n", "\n", "metadata:\n", " name: eksworkshop-eksctl\n", " region: {region_name}\n", " version: \"{kubectl_ver}\"\n", "\n", "availabilityZones: [\"{region_name}a\", \"{region_name}b\"]\n", "\n", "managedNodeGroups:\n", "- name: nodegroup\n", " desiredCapacity: 2\n", " instanceType: t3.small" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3. Create the EKS cluster\n", "\n", "*Note*: this step can take up to 15 minutes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!eksctl create cluster -f eksworkshop.yaml" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 4. Add S3 write permission to the node execution role. \n", "\n", "With the default setting, eks nodes will have an execution role with the following permissions: \n", "1. AmazonEKSWorkerNodePolicy - AWS managed policy\n", "1. AmazonEC2ContainerRegistryReadOnly- AWS managed policy\n", "1. AmazonEKS_CNI_Policy - AWS managed policy\n", "\n", "Our container needs permission to write to ${sra_output bucket}, so we need to add S3 write permission to the bucket. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# check the node status\n", "!kubectl get nodes # if we see our 2 nodes, we know we have authenticated correctly\n", "\n", "# EKS clusters are deployed using CloudFormation stacks in the backend. \n", "# get the node group stack name and use that to get the nodegroup instance role name\n", "STACK_NAME = !eksctl get nodegroup --cluster eksworkshop-eksctl -o json | jq -r '.[].StackName'\n", "STACK_NAME=STACK_NAME[0]\n", "\n", "ROLE_NAME= !aws cloudformation describe-stack-resources --stack-name $STACK_NAME | jq -r '.StackResources[] | select(.ResourceType==\"AWS::IAM::Role\") | .PhysicalResourceId'\n", "ROLE_NAME=ROLE_NAME[0]\n", "\n", "resp = iam_client.put_role_policy(RoleName=ROLE_NAME, \n", " PolicyName='S3AccessPolicy', \n", " PolicyDocument='{\"Version\": \"2012-10-17\",\"Statement\": [{\"Sid\": \"wsbucket\",\"Effect\": \"Allow\",\"Action\": [\"s3:PutObject\",\"s3:GetObject\",\"s3:ListBucket\"],\"Resource\": [\"arn:aws:s3:::'+bucket+'/*\",\"arn:aws:s3:::'+bucket+'\"]}]}')\n", " \n", "print(resp)\n", "\n", "# Add s3:PutObject to $sra_output bucket so container can upload result to that bucket\n", "sra_eks_output = f\"\"\"s3://{bucket}/data/sra-toolkit-eks/fasterq\"\"\"\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 5. Create job and deploy it to the eks cluster" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are creating a new Kubernetes job launching the sratool application using the container image that we placed into the ECR repository on Step 8. We will use PACKAGE_NAME and SRA_OUTPUT as parameters for invoking our container. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writetemplate sratool-deploy.yaml\n", "apiVersion: batch/v1\n", "kind: Job\n", "metadata:\n", " name: my-pod\n", " namespace: default\n", " labels:\n", " app: my-sratool\n", "spec:\n", " template:\n", " metadata:\n", " labels:\n", " app: my-sratool\n", " spec:\n", " containers:\n", " - name: sratool\n", " image: {image_uri}\n", " env:\n", " - name: PACKAGE_NAME\n", " value: {PACKAGE_NAME}\n", " - name: SRA_OUTPUT\n", " value: {sra_eks_output}\n", " restartPolicy: Never" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use instance metadata and STS to capture the current region and account id for the ECR login process. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!REGION=$(curl --silent http://169.254.169.254/latest/meta-data/placement/availability-zone | sed 's/[a-z]$//') && \\\n", " ACCOUNT=$(aws sts get-caller-identity|jq -r '.Account') && \\\n", " aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT.dkr.ecr.$REGION.amazonaws.com\n", "\n", "#!kubectl delete -f sratool-deploy.yaml \n", "\n", "!kubectl apply -f sratool-deploy.yaml " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 6. Monitor the job execution.\n", "\n", "In the previous step, we ran the container as a job on the eks cluster. Kubectl provides tools for you to monitor the job status. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!kubectl describe jobs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## you can repeat the run of this cell, till you see the logs of results uploaded to S3. \n", "\n", "!kubectl logs job/my-pod" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Option B. Run the container in AWS Batch\n", "\n", "\n", "If you want a deeper dive on AWS Batch, please refer to notebooks/hpc/batch-fastqc.ipynb, in this workshop repo. \n", "\n", "In this notebook, we will create an AWS Batch environemnt and run the container job using the image we created. \n", "\n", "### Step 1. Create a compute environment\n", "To create a Compute Environment, you need the following\n", "1. EC2 instance role\n", "2. EC2 instance profile\n", "3. Batch service role\n", "4. VPC and subnets \n", "5. Security group\n", "6. Compute resource definition\n", "\n", "**Note**: This step will take up to 10 minutes. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "try: \n", " ce_response = workshop.create_simple_compute_environment(proj_name)\n", " ce_name = ce_response['computeEnvironmentName']\n", "except ClientError as ee:\n", " if ee.response['Error']['Message'] == 'Object already exists':\n", " print('Compute environemnt already exists. skip')\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2. Create a job queue\n", "To create a job queue, you need the compute environment\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "queue_name, queue_arn = workshop.create_job_queue(ce_name, 1)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3. Create job definition\n", "We need the service role and corresponding policy (e.g., S3FullAccess) here.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "importlib.reload(workshop)\n", "\n", "batch_task_role_name = f\"batch_task_role_{proj_name}\"\n", "batch_task_policies = [\"arn:aws:iam::aws:policy/AmazonS3FullAccess\"]\n", "taskRole = workshop.create_service_role_with_policies(batch_task_role_name, \"ecs-tasks.amazonaws.com\", batch_task_policies)\n", "print(taskRole)\n", "\n", "\n", "jd = workshop.create_job_definition(proj_name, image_uri, taskRole )\n", "jd_name = jd['jobDefinitionName']\n", "print(jd_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 4. We are now ready to submit the job\n", "\n", "We will replicate the results from container-basic notebook in the local environment with PACKAGE_NAME='SRR000004', but this time using AWS Batch and submitting a job to another machine in our compute environment. This approach allows us to scale our computations well beyond the capabilities of our local environment/machine." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "PACKAGE_NAME='SRR000004'\n", "\n", "batch_client = boto3.client('batch')\n", "response = batch_client.submit_job(\n", " jobName=f\"job-{proj_name}\",\n", " jobQueue=queue_name,\n", " jobDefinition=jd_name,\n", " containerOverrides={\n", " 'environment': [\n", " {\n", " 'name': 'PACKAGE_NAME',\n", " 'value': PACKAGE_NAME\n", " },\n", " {\n", " 'name': 'SRA_OUTPUT',\n", " 'value': sra_output\n", " } \n", " ]\n", " })\n", "\n", "job_id = response['jobId']\n", "print(f\"Job submitted: {job_id}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Check the output\n", "\n", "As the container (and correspondingly its internal sra code) is now running onto a remote machine, we cannot use the Docker command line tools to inspect its status. Instead, we can use the AWS Batch console and check the job queue for successful completion. \n", "\n", "Alternatively, we can run the code in the cell below. Note: if you do not see the SUCCESS status, please retry after a few seconds.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#jobs = batch_client.list_jobs(jobQueue=queue_name)\n", "jobs = batch_client.describe_jobs(jobs=[job_id])\n", "\n", "for j in jobs['jobs']:\n", " print(f\"JobName: {j['jobName']} Status: {j['status']}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Wait.... till the previous step returns a SUCCESS status\n", "The processing might take a short or long time to complete, depending on what we are doing inside the container. AWS Batch takes care of allocating the necessary computing resources, sending our container to the respective machine(s), kickstarting the execution, monitoring for completion and keeping us informed via the respective job queue statuses. Under the hood, AWS Batch uses the Elastic Container Service (ECS) to automate these tasks. \n", "\n", "### Success?\n", "\n", "Let's check out our bucket for results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# checkout the outfiles on S3\n", "s3_client = session.client('s3')\n", "objs = s3_client.list_objects(Bucket=bucket, Prefix=sra_prefix)\n", "for obj in objs['Contents']:\n", " fn = obj['Key']\n", " p = os.path.dirname(fn)\n", " if not os.path.exists(p):\n", " os.makedirs(p)\n", " s3_client.download_file(bucket, fn , fn)\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from bioinfokit.analys import fastq\n", "fastq_iter = fastq.fastq_reader(file=f\"{sra_prefix}/{PACKAGE_NAME}/{PACKAGE_NAME}.fastq\") \n", "# read fastq file and print out the first 10, \n", "i = 0\n", "for record in fastq_iter:\n", " # get sequence headers, sequence, and quality values\n", " header_1, sequence, header_2, qual = record\n", " # get sequence length\n", " sequence_len = len(sequence)\n", " # count A bases\n", " a_base = sequence.count('A')\n", " if i < 10:\n", " print(sequence, qual, a_base, sequence_len)\n", " i +=1\n", "\n", "print(f\"Total number of records for package {PACKAGE_NAME} : {i}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Let's process multiple files at the same time \n", "\n", "Now we learned how to run a single job in batch. Now we can try to run multiple jobs with the same container with different input parameters. \n", "\n", "You can also use AWS Batch Array Jobs (https://docs.aws.amazon.com/batch/latest/userguide/array_index_example.html) to process multiple jobs at the same time. $AWS_BATCH_JOB_ARRAY_INDEX will be passed to each instance of the container. You can use that to identify input or different paramenters for your job. We will use the array index to identify the package_name. \n", "\n", "Since our previous container was designed to run a single job with PACKAGE_NAME passed down as environment variable, we will modify the Dockerfile a little to take an array of package names. We will take this opportunity to show you how the container CI/CD pipeline can help us with the automation. Once you submit the file changes to CodeCommit, it will trigger the code pipeline to kick off the build and the image will be updated. \n", "\n", "This approach uses a file that contains the list of all files that needs to be processed. This requires the rebuild of the container image. You can also pass the list of all the package names in a list as an environment variable or parameter to the container. But that only works if the list is relatively small. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "package_list=['SRR000011', 'SRR000012', 'SRR000013', 'SRR000014', 'SRR000016']\n", "\n", "file_list_content = ''\n", "for p in package_list:\n", " file_list_content +=p+'\\n'\n", " \n", "sratest_content = \"\"\"#!/bin/bash\n", "\n", "set -x\n", "LINE=$((AWS_BATCH_JOB_ARRAY_INDEX+1))\n", "FILE_NAME=$(sed -n ${LINE}p /tmp/filelist.txt)\n", "prefetch $FILE_NAME --output-directory /tmp\n", "fasterq-dump $FILE_NAME -e 8\n", "aws s3 sync . $SRA_OUTPUT/$FILE_NAME\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will now push these changes into our CodeCommit repository." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "put_files=[{\n", " 'filePath': 'filelist.txt',\n", " 'fileContent': file_list_content\n", " },\n", " {\n", " 'filePath': 'sratest.sh',\n", " 'fileContent': sratest_content\n", " }]\n", "\n", "resp = codecommit_client.get_branch(repositoryName=proj_name, branchName='main')\n", "parent_commit_id = resp['branch']['commitId']\n", "try:\n", " resp = workshop.commit_files(proj_name, \"main\",put_files, parent_commit_id)\n", "except ClientError as ee:\n", " if ee.response['Error']['Code'] == 'NoChangeException':\n", " print('No change detected. skip commit')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note** After changing the two files, the CodePipeline will kickoff, please go to the CodePipeline console to check the build status. You need to **wait till the build completes** to submit the array job below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = batch_client.submit_job(\n", " jobName=f\"job-{proj_name}\",\n", " jobQueue=queue_name,\n", " arrayProperties={\n", " 'size': len(package_list)\n", " },\n", " jobDefinition=jd_name,\n", " containerOverrides={\n", " 'environment': [\n", " {\n", " 'name': 'SRA_OUTPUT',\n", " 'value': sra_output\n", " } \n", " ]\n", " })\n", "\n", "job_id = response['jobId']\n", "print(f\"Job submitted: {job_id}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check our progress.... Once you see JobName: job-mysra-cicd-pipeline Status: SUCCEEDED, you can got to S3://container-ws-xxxxx/data/sra-toolkit/fasterq and see SRR000011,12,13,14,16 data folders created as the result of the array job on AWS Batch.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "jobs = batch_client.describe_jobs(jobs=[job_id])\n", "\n", "for j in jobs['jobs']:\n", " print(f\"JobName: {j['jobName']} Status: {j['status']}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Don't forget to clean up " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cleanup Batch jobs\n", "workshop.delete_job_queue(queue_name)\n", "workshop.delete_job_definition(jd_name)\n", "workshop.delete_simple_compute_environment(proj_name)\n", "workshop.delete_service_role_with_policies(batch_task_role_name, batch_task_policies)\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# delete the CI/CD pipeline and repository\n", "codepipeline_client.delete_pipeline(name=proj_name)\n", "codebuild_client.delete_project(name=codebuild_name)\n", "codecommit_client.delete_repository(repositoryName=proj_name)\n", "workshop.delete_codecommit_repo(proj_name)\n", "workshop.delete_ecr_repo(proj_name)\n", "workshop.delete_service_role_with_policies(codepipeline_service_role_name, codepipeline_policies )\n", "workshop.delete_service_role_with_policies(codebuild_service_role_name, codebuild_policies )\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Cleanup S3 bucket\n", "workshop.delete_bucket_with_version(bucket)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Release EKS resources\n", "!kubectl delete -f sratool-deploy.yaml " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "try:\n", " resp = iam_client.delete_role_policy(RoleName=ROLE_NAME, PolicyName='S3AccessPolicy')\n", "except:\n", " print(\"Policy might have been deleted alareay. Ignore\")\n", "!eksctl delete cluster -f eksworkshop.yaml" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 4 }