{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab Environment for BYOA Pipeline\n", "\n", "This notebook instance will act as the lab environment for setting up and triggering changes to our pipeline. This is being used to provide a consistent environment, gain some familiarity with Amazon SageMaker Notebook Instances, and to avoid any issues with debugging individual laptop configurations during the workshop. \n", "\n", "PLEASE review the [sample notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb) for detailed documentation on the model being built\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: View the Data\n", "\n", " In the code cell below, we'll take a look at the training/test/validation datasets and then upload them to S3. Just as the sample notebook referenced above, For the purposes of this example, we're using some the classic [Iris Flower](https://en.wikipedia.org/wiki/Iris_flower_data_set) dataset, which we have included in the local notebook instance under the ./data folder." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Training Data\n", " setosa 5.1 3.5 1.4 0.2\n", "0 setosa 4.9 3.0 1.4 0.2\n", "1 setosa 4.7 3.2 1.3 0.2\n", ".. ... ... ... ... ...\n", "147 virginica 6.2 3.4 5.4 2.3\n", "148 virginica 5.9 3.0 5.1 1.8\n", "\n", "[149 rows x 5 columns]\n", "\n", "Smoke Test Data\n", " setosa 5.0 3.5 1.3 0.3\n", "0 versicolor 5.5 2.6 4.4 1.2\n", "1 virginica 5.8 2.7 5.1 1.9\n", "\n", "Validation Data\n", " setosa 5.2 3.5 1.5 0.2\n", "0 setosa 5.2 3.4 1.4 0.2\n", "1 setosa 4.7 3.2 1.6 0.2\n", ".. ... ... ... ... ...\n", "42 versicolor 5.9 3.2 4.8 1.8\n", "43 versicolor 6.1 2.8 4.0 1.3\n", "\n", "[44 rows x 5 columns]\n" ] } ], "source": [ "import pandas as pd\n", "train_data = pd.read_csv('./data/1-train/train/iris.csv', sep=',')\n", "pd.set_option('display.max_columns', 500) # Make sure we can see all of the columns\n", "pd.set_option('display.max_rows', 5) # Keep the output on one page\n", "print('\\nTraining Data\\n', train_data)\n", "\n", "smoketest_data = pd.read_csv('./data/1-train/smoketest/iris.csv', sep=',')\n", "pd.set_option('display.max_columns', 500) # Make sure we can see all of the columns\n", "pd.set_option('display.max_rows', 5) # Keep the output on one page\n", "print('\\nSmoke Test Data\\n', smoketest_data)\n", "\n", "validation_data = pd.read_csv('./data/1-train/validation/iris.csv', sep=',')\n", "pd.set_option('display.max_columns', 500) # Make sure we can see all of the columns\n", "pd.set_option('display.max_rows', 5) # Keep the output on one page\n", "print('\\nValidation Data\\n', validation_data)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Step 2: Upload Data to S3 \n", "\n", "We will utilize this notebook to perform some of the setup that will be required to trigger the first execution of our pipeline. In this second step in our Machine Learning pipeline, we are going to simulate what would typically be the last step in an Analytics pipeline of creating datasets. \n", "\n", "To accomplish this, we will actually be uploading data from our local notebook instance (data can be found under /data/1-train/*) to S3. In a typical scenario, this would be done through your analytics pipeline. We will use the S3 bucket that was created through the CloudFormation template we launched at the beginning of the lab. You can validate the S3 bucket exists by:\n", " 1. Going to the [S3 Service](https://s3.console.aws.amazon.com/s3/) inside the AWS Console\n", " 2. Find the name of the S3 data bucket created by the CloudFormation template: mlops-data-*yourintials*-*randomid*\n", " \n", "In the code cell below, we'll take a look at the training/test/validation datasets and then upload them to S3. \n", "\n", " ### UPDATE THE BUCKET NAME BELOW BEFORE EXECUTING THE CELL" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "import os\n", "import boto3\n", "import re\n", "import time\n", "\n", "# UPDATE THE NAME OF THE BUCKET TO MATCH THE ONE WE CREATED THROUGH THE CLOUDFORMATION TEMPLATE\n", "# Example: mlops-data-jdd-df4d4850\n", "#bucket = 'mlops-data--'\n", "bucket = 'mlops-data-jdd-d4d740c0'\n", "\n", "\n", "from sagemaker import get_execution_role\n", "\n", "role = get_execution_role()\n", "region = boto3.Session().region_name\n", "\n", "\n", "trainfilename = 'train/train.csv'\n", "smoketestfilename = 'smoketest/smoketest.csv'\n", "validationfilename = 'validation/validation.csv'\n", "\n", "\n", "s3 = boto3.resource('s3')\n", "\n", "s3.meta.client.upload_file('./data/1-train/train/iris.csv', bucket, trainfilename)\n", "s3.meta.client.upload_file('./data/1-train/smoketest/iris.csv', bucket, smoketestfilename)\n", "s3.meta.client.upload_file('./data/1-train/validation/iris.csv', bucket, validationfilename)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Step 3: Commit Training Code To Trigger Pipeline Build\n", "\n", "In this step, we are going to trigger an execution of the pipeline by committing our training code to the CodeCommit repository that was setup as part of the CloudFormation stack. The CodeCommit repository created was associated with this SageMaker Notebook Instance via a setting in the CloudFormation Stack using the [SageMaker Notebook Instance Git Association](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-git-repo.html) feature.\n", "\n", "The pipeline is currently setup to trigger on a commit to the master branch; however, this should be adjusted in a real-world scenario based on your branching strategy. \n", "\n", "The CodeCommit repository created can be viewed by:\n", " 1. Going to the [CodeCommit Service](https://console.aws.amazon.com/codesuite/codecommit/repositories) inside the AWS Console\n", " 2. Find the name of the repository created by the CloudFormation template: mlops-codecommit-byo-*yourinitials*\n", " \n", "**UPDATE** Ensure you update the cell below where noted prior to executing " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "origin\thttps://git-codecommit.us-east-1.amazonaws.com/v1/repos/mlops-codecommit-byo-jdd (fetch)\r\n", "origin\thttps://git-codecommit.us-east-1.amazonaws.com/v1/repos/mlops-codecommit-byo-jdd (push)\r\n" ] } ], "source": [ "# View the CodeCommit repository -\n", "# This Git integration was configured as part of the creation of the notebook instance in the CloudFormation stack.\n", "\n", "# The following will return the CodeCommit repository that has been configured with this notebook and will be used \n", "# for the source control repository during this workshop. \n", "\n", "# Ensure remote repo is setup\n", "!git remote -v\n", "\n", "!git config --global user.name \"Jane Smith\"\n", "!git config --global user.email JaneSmith@example.com" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Commit training code to the CodeCommit repository to trigger the execution of the CodePipeline" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Already up-to-date.\n", "[master ebb3723] Initial add of model code to CodeCommit Repo\n", " 2 files changed, 2 insertions(+), 9 deletions(-)\n", "Counting objects: 6, done.\n", "Delta compression using up to 2 threads.\n", "Compressing objects: 100% (5/5), done.\n", "Writing objects: 100% (6/6), 502 bytes | 502.00 KiB/s, done.\n", "Total 6 (delta 4), reused 0 (delta 0)\n", "To https://git-codecommit.us-east-1.amazonaws.com/v1/repos/mlops-codecommit-byo-jdd\n", " 7594aae..ebb3723 master -> master\n" ] } ], "source": [ "!git pull\n", "!git add ./model-code/*\n", "!git commit -m \"Initial add of model code to CodeCommit Repo\"\n", "!git push origin master" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "## Step 4: Monitor CodePipeline Execution\n", "\n", "The code above will trigger the execution of your CodePipeline. You can monitor progress of the pipeline execution in the [CodePipeline dashboard](https://console.aws.amazon.com/codesuite/codepipeline/pipelines).\n", "\n", "You can also validate that your code is now committed to the CodeCommit repository in the [CodeCommit dashboard](https://console.aws.amazon.com/codesuite/codecommit/repositories)\n", "\n", "As the pipeline is executing information is being logged to [Cloudwatch logs](https://console.aws.amazon.com/cloudwatch/logs). Explore the logs for your Lambda functions (/aws/lambda/MLOps-BYO*) as well as output logs from SageMaker (/aws/sagemaker/*)\n", "\n", "\n", "Note: It will take awhile to execute all the way through the pipeline. Please don't proceed to the next step until the last stage is shows **'succeeded'**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--- \n", "\n", "## Step 5: Edit CodePipeline Triggers to add model retraining\n", "\n", "In the steps above, we demonstrated how you can trigger the pipeline when new code is committed to CodeCommit. As you continue to modify your training/inference code, you essentially re-execute the cell above to commit new code and trigger another execution of the pipeline. Although we are using a notebook instance for the purposes of the workshop, the commit to a source code repository above can happen in your local environment and/or IDEs of choice. \n", "\n", "In this step, we want to modify the pipeline to add the capability to not only trigger based off code changes but to also trigger a retraining cycle in the event of receiving new training data. If the file/object containing the training data is a single object that will be inclusive of all training data you want to use, you can trigger CodePipeline based on the object itself. However, if you are looking to training incrementally or if your analytics pipeline puts new data to your S3 bucket as deltas then you need to have a trigger mechanism included in your analytics pipeline that notifies CodePipeline it is time to retrain. \n", "\n", "To simplify the setup for this workshop, we are going to trigger based on a new training dataset that is inclusive of all the data we want to retrain with. \n", "\n", " 1. Go to your [CodePipeline Pipeline](https://console.aws.amazon.com/codesuite/codepipeline/pipelines) \n", "\n", " 2. Click on the link to your pipeline (i.e. MLOps-BYO-BuildPipeline*)\n", " \n", " 3. Click the **Edit** Button \n", " \n", " 4. Inside **Edit:Source** , click the **Edit stage** button --> then Click **Add Action**\n", " \n", " 5. Under Edit Action:\n", " \n", " * **Action Name** : RetrainData\n", " \n", " * **Action Provider** : Source - S3\n", " \n", " * **Bucket** : *Enter the name of your S3 data bucket* Example: mlops-data-jdd-d4d740c0\n", " \n", " * **S3 object key** : train/train.csv\n", " \n", " * **Output Artifacts** : RetrainDataIn\n", " \n", " \n", " 6. Validate your screen contains all information as shown below: \n", " \n", " \n", " \n", " 7. Click **Done**\n", " \n", " 8. Click the orange **Save** button in the upper right hand corner to save your changes, confirm changes and hit **Save** again." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because we may not want rebuild the training/inference container image in the case where we only want to retrain the model, you could optionally create a separate retraining pipeline that excludes the rebuild of the image. Depending on what you are using for orchestration across your pipeline, you can accomplish this through a single or multiple pipelines. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "## Step 6: Trigger retraining based on new data\n", " \n", "Let's test our new CodePipeline trigger by adding new training data to our S3 data bucket. The S3 data bucket is setup with versioning enabled. " ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "trainfilename = 'train/train.csv'\n", "\n", "s3 = boto3.resource('s3')\n", "\n", "s3.meta.client.upload_file('./data/1-train/train/iris-2.csv', bucket, trainfilename)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After executing the cell above, you will see a new trigger for pipeline execution when you click on the link to your [pipeline](https://console.aws.amazon.com/codesuite/codepipeline/pipelines)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Step 7: Clean-Up\n", "\n", "\n", "Return to the [README.md](https://github.com/aws-samples/amazon-sagemaker-devops-with-ml/2-Bring-Your-Own/README.md) to complete the environment cleanup instructions. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# CONGRATULATIONS! \n", "\n", "You've built a basic pipeline for the use case of bringing your own algorithm/training/inference code to SageMaker. This pipeline can act as a starting point for building in additional quality gates such as container scans, manual approvals, and additional evaluation/logging capabilities. Another common extension to the pipeline may be creating/updating your API serving predictions through API Gateway. " ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }