{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab: Bring your own script Challenge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "Your new colleague in the data science team (who isn't very familiar with SageMaker) has written a nice notebook to tackle a classification problem with scikitlearn: `\"skLearn-Local Notebook.ipynb\"`.\n", "\n", "It works OK with the simple IRIS data set they were working on before, but now they'd like to take advantage of some of the features of SageMaker to tackle bigger and harder challenges.\n", "\n", "Can you help refactor the Local Notebook code, to show them how to use SageMaker effectively?\n", "\n", "## Getting Started\n", "\n", "First, check you can run the **sklearn-Local Notebook.ipynb** notebook through - reviewing what steps it takes.\n", "\n", "This notebook sets out a structure you can use to migrate code into, and lists out some of the changes you'll need to make at a high level. You can either work directly in here, or duplicate this notebook so you still have an unchanged copy of the original.\n", "\n", "Try to work through the sections first with an MVP goal in mind (fitting the model to data in S3 via a SageMaker Training Job, and deploying/using the model through a SageMaker Endpoint). The goal is to understand the big picture on how you can bring your own code to SageMaker and scale your training and deploy. You can always have more advance models and more complex training code. \n", "\n", "The excercise of bringing your own training code to SageMaker is what we call ***'Script Mode'***. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sklearn script mode training and serving\n", "Script mode is a training script format for a number of supported frameworks that lets you execute the training script in SageMaker with minimal modification (read more details in this blog [Script mode](https://aws.amazon.com/blogs/machine-learning/bring-your-own-model-with-amazon-sagemaker-script-mode/)). The [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) handles transferring your script to a SageMaker training instance. On the training instance, SageMaker's native SKlearn support sets up training-related environment variables and executes your training script. Script mode supports training with a Python script, a Python module, or a shell script. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dependencies\n", "Listing all our imports at the start helps to keep the requirements to run any script/file transparent up-front, and is specified by nearly every style guide including Python's official [PEP 8](https://peps.python.org/pep-0008/#imports)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "from sklearn.model_selection import train_test_split\n", "import numpy as np\n", "\n", "# TODO: What else will you need?\n", "# Have a look at the documentation: https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html\n", "# to see which libraries need to be imported to use sagemaker and the Sklearn estimator estimator\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare the Data\n", "We download the Iris data from UCI Machine Learning repository directly from the web. this is the url where you can get the data similar to what we did in the \"sklearn-local Notebook.ipynb\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#TODO: download the data from internet\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#read in the data with the headers\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#split the data into train and test\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set up the environment: Execution Role, Session and S3 Bucket\n", "Now that we have downloaded and reduced the data in the local directory, we will need to upload it to Amazon S3 to make it available for Amazon Sagemaker training.\n", "\n", "Let's start by specifying:\n", "\n", "- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region.\n", "\n", "- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the get_execution_role method from sagemaker python SDK." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO: This is where you can setup execution role, session and S3 bucket.\n", "\n", "#define samemaker role \n", "\n", "#define sagemaker session\n", "\n", "#define default bucket\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Upload Data to Amazon S3\n", "Next is the part where you need to upload the images to Amazon S3 for Sagemaker training. You can refer to the previous example on how to do it using the aws s3 sync CLI command or using the boto3 SDK. The high-level command aws s3 sync command synchronizes the contents of the target bucket and source directory. It allows the use of options such as --delete that allows to remove objects from the target that are not present in the source and --exclude or --include options that filter files or objects to exclude or not exclude.\n", "\n", "⏰ Note: Uploading to Amazon S3 typically takes about 2-3 minutes assuming a reduction_factor of 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#TODO: import aws boto3 library\n", "\n", "\n", "#TODO:convert the test and train to CSV \n", "\n", "\n", "#TODO:upload the data on to your sagemaker defulat S3 bucket in a folder called training and your data called 'data.csv'\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Input (\"Channels\") Configuration\n", "The draft code has 2 data sets: One for training, and one for test/validation. \n", "\n", "In SageMaker terminology, each input data set is a \"channel\" and we can name them however we like... Just make sure you're consistent about what you call each one!\n", "\n", "For a simple input configuration, a channel spec might just be the S3 URI of the folder. For configuring more advanced options, there's the s3_input class in the SageMaker SDK." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO: Define your 2 data channels (train and test)\n", "# The data can be found in: \"s3://{bucket_name}/mnist/training\" and \"s3://{bucket_name}/mnist/testing\"\n", "# We can use either the s3_input (which gives us additional configuration options), or a plain string:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Algorithm (\"Estimator\") Configuration and Run\n", "Instead of loading and fitting this data here in the notebook, we'll be creating a Sklearn Estimator through the SageMaker SDK, to run the code on a separate container that can be scaled as required.\n", "\n", "The [\"Using SKlearn with the SageMaker Python SDK\"](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#using-scikit-learn-with-the-sagemaker-python-sdk) docs give a good overview of this process. You should run your estimator in script mode (which is easier to follow than the old default legacy mode) and as Python 3.\n", "\n", "## Use the `**main.py**` file already prepared for you in your local directory as your entry point to port code into - which has already been created for you with some basic hints.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#TODO:define your estimator using SKlearn framework\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Before running the actual training on SageMaker TrainingJob, it can be good to run it locally first using the code below. If there is any error, you can fix them first before running using SageMaker TrainingJob." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python3 ./main.py --train ./ --test ./ --model-dir ./ --n_estimators=100 --min_samples_leaf=3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Calling `fit`\n", "When you're ready to try your script in a SageMaker training job, you can call estimator.fit() as we did in previous exercises:To start a training job, we call `estimator.fit(training_data_uri)`.\n", "\n", "When training is complete, the training job will upload the saved model to S3 for deployment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#TODO:call the fit function and pass on your data you uploaded to S3 above for the training to start\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deploy and Use Your Model (Real-Time Inference)\n", "We are now ready to deploy our model to Sagemaker hosting services and make real time predictions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#TODO:deploy the model to a real time endpoint\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now send some data to our model to predict- the data shouldbe sent in the accepted format (The data sent to the endpoint for this model should be 'text.csv' format) and the code below just does that. We also ensure to perform the same processing on our test, same as what we did on our training data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#TODO:now get some test data to test your model and process them similar to our training set\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#TODO:the body that you send to your model enpoint should be text/csv format, get your data to the right format before sending it to you model endpoint for prediciton, each observation should be placed on a new line\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#TODO:now envoke your endpoint and get predictions\n", "\n", "\n" ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:ap-southeast-2:452832661640:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 5 }