{ "cells": [ { "cell_type": "markdown", "id": "1b84c609", "metadata": {}, "source": [ "# Use SKlearn and Amazon SageMaker Clarify\n", "_**Run Amazon SageMaker Clarify processing after you trained a model**_\n", "\n", "---\n", "\n", "Take introduction from here:\n", "\n", "https://github.com/aws/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/linear_learner_abalone/Linear_Learner_Regression_csv_format.ipynb\n", "\n", "## Contents\n", "1. [Introduction](#Introduction)\n", "2. [Setup](#Setup)\n", " 1. [Source the libraries](#Source-the-libraries)\n", " 2. [Set S3 bucket and data prefix](#Set-S3-bucket-and-data-prefix)\n", " 3. [Set role and global vars](#Set-role-and-global-vars)\n", "3. [Load the data](#Load-the-data)\n", "4. [Upload the data to S3](#Upload-the-data-to-S3)\n", "5. [Train a SKLearn estimator](#Train-a-SKLearn-estimator)\n", "6. [Amazon SageMaker Clarify](#Amazon-SageMaker-Clarify)\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "4e96395a", "metadata": {}, "source": [ "## Introduction\n", "\n", "This notebook demonstrates the use of Amazon SageMaker SKLearn to train a regression model. \n", "\n", "We use the [Abalone data](https://datahub.io/machine-learning/abalone), originally from the [UCI data repository](https://archive.ics.uci.edu/ml/datasets/abalone). More details about the original dataset can be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.names).\n", "\n", "---\n", "## Setup\n", "\n", "This notebook was tested in Amazon SageMaker notebook on a ml.t3.medium instance with Python 3 (conda_python3) kernel.\n", "\n", "Let's start by specifying:\n", "1. Sourcing libraries\n", "2. The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.\n", "3. The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).\n", "4. The global variables used later for training the model" ] }, { "cell_type": "markdown", "id": "ad4680fa", "metadata": {}, "source": [ "### Source the libraries" ] }, { "cell_type": "code", "execution_count": 1, "id": "ee42db9f", "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "from sagemaker import get_execution_role\n", "from sagemaker.sklearn.estimator import SKLearn\n", "import pandas as pd\n", "import numpy as np\n", "import urllib\n", "import boto3\n", "import json\n", "import os\n", "import time\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "id": "d1c2010f", "metadata": {}, "source": [ "### Set S3 bucket and data prefix" ] }, { "cell_type": "code", "execution_count": 2, "id": "ccc37199", "metadata": {}, "outputs": [], "source": [ "# Provide information to where the training and validation data will be uploaded to \n", "S3_BUCKET = 'sagemaker-clarify-demo' # YOUR_S3_BUCKET\n", "PREFIX = 'abalone-clarify-notebook'\n", "DATA_PREFIX = f'{PREFIX}/prepared_data'" ] }, { "cell_type": "markdown", "id": "e64455c5", "metadata": {}, "source": [ "### Set role and global vars" ] }, { "cell_type": "code", "execution_count": 3, "id": "f8fa69b3", "metadata": {}, "outputs": [], "source": [ "# Get a SageMaker-compatible role used by this function and the session.\n", "sagemaker_session = sagemaker.Session()\n", "region = sagemaker_session.boto_region_name\n", "role = get_execution_role()\n", "\n", "# Set your instance count and type\n", "framework_version = '0.23-1'\n", "instance_type = 'ml.m5.xlarge'\n", "instance_count = 1\n", "\n", "# Set your code folder\n", "source_dir = 'model/'\n", "entry_point = 'predictor.py'" ] }, { "cell_type": "markdown", "id": "646a34e6", "metadata": {}, "source": [ "## Load the data" ] }, { "cell_type": "markdown", "id": "aa0f8d52", "metadata": {}, "source": [ "Read the dataset from your raw input prefix within S3" ] }, { "cell_type": "code", "execution_count": 4, "id": "98d2b703", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('https://datahub.io/machine-learning/abalone/r/abalone.csv')\n", "cols = [x if \"rings\" not in x else \"Rings\" for x in df.columns]\n", "df.columns = cols" ] }, { "cell_type": "code", "execution_count": 5, "id": "eba9d64f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Sex | \n", "Length | \n", "Diameter | \n", "Height | \n", "Whole_weight | \n", "Shucked_weight | \n", "Viscera_weight | \n", "Shell_weight | \n", "Rings | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "M | \n", "0.455 | \n", "0.365 | \n", "0.095 | \n", "0.5140 | \n", "0.2245 | \n", "0.1010 | \n", "0.150 | \n", "15 | \n", "
1 | \n", "M | \n", "0.350 | \n", "0.265 | \n", "0.090 | \n", "0.2255 | \n", "0.0995 | \n", "0.0485 | \n", "0.070 | \n", "7 | \n", "
2 | \n", "F | \n", "0.530 | \n", "0.420 | \n", "0.135 | \n", "0.6770 | \n", "0.2565 | \n", "0.1415 | \n", "0.210 | \n", "9 | \n", "
3 | \n", "M | \n", "0.440 | \n", "0.365 | \n", "0.125 | \n", "0.5160 | \n", "0.2155 | \n", "0.1140 | \n", "0.155 | \n", "10 | \n", "
4 | \n", "I | \n", "0.330 | \n", "0.255 | \n", "0.080 | \n", "0.2050 | \n", "0.0895 | \n", "0.0395 | \n", "0.055 | \n", "7 | \n", "