{ "cells": [ { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# Single Objective Optimization with Credit Risk Analysis " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "#Cell magic to hide outputs \n", "\n", "#Install dependencies\n", "from io import StringIO\n", "import os\n", "import time\n", "import sys\n", "import IPython\n", "from time import gmtime, strftime\n", "\n", "import boto3\n", "import numpy as np\n", "import pandas as pd\n", "import urllib\n", "\n", "import sagemaker\n", "from sagemaker.s3 import S3Uploader\n", "from sagemaker.processing import ProcessingInput, ProcessingOutput\n", "from sagemaker.sklearn.processing import SKLearnProcessor\n", "from sagemaker.inputs import TrainingInput\n", "from sagemaker.xgboost import XGBoost\n", "from sagemaker.s3 import S3Downloader\n", "from sagemaker.s3 import S3Uploader\n", "from sagemaker import Session\n", "from sagemaker import get_execution_role\n", "from sagemaker.xgboost import XGBoostModel\n", "from sagemaker.sklearn import SKLearnModel\n", "from sagemaker.pipeline import PipelineModel\n", "from sagemaker.inputs import TrainingInput\n", "\n", "\n", "session = Session()\n", "bucket = session.default_bucket()\n", "prefix = \"sagemaker/sagemaker-amt-credit-risk-model\"\n", "region = session.boto_region_name\n", "\n", "# Define IAM role\n", "role = get_execution_role()" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Train XGBoost Model\n", "In this step, we will train an XGBoost model on the preprocessed data. We will use our own training script with the built-in XGBoost container provided by SageMaker.\n", "\n", "Alternatively, for your own use case, you can also bring your own model (trained elsewhere) to SageMaker for processing with SageMaker Clarify\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up XGBoost Estimator\n", "\n", "Next, let us set up: \n", " 1. Pre-defined values for Hyperparameters for XGBoost algorithm\n", " 1. XGBoost Estimator for SageMaker\n", "\n", " " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "hyperparameters = {\n", " \"max_depth\": \"5\",\n", " \"eta\": \"0.1\",\n", " \"gamma\": \"4\",\n", " \"min_child_weight\": \"6\",\n", " \"silent\": \"1\",\n", " \"objective\": \"binary:logistic\",\n", " \"num_round\": \"100\",\n", " \"subsample\": \"0.8\",\n", " \"eval_metric\": \"auc\",\n", " \"early_stopping_rounds\": \"20\",\n", " \"output_data_dir\": \"/opt/ml/output/data/\",\n", "}\n", "\n", "entry_point = \"train_xg_amt_single.py\"\n", "source_dir = \"training/\"\n", "output_path = \"s3://{0}/{1}/{2}\".format(bucket, prefix, \"xgb_model\")\n", "code_location = \"s3://{0}/{1}/code\".format(bucket, prefix)\n", "\n", "estimator = XGBoost(\n", " entry_point=entry_point,\n", " source_dir=source_dir,\n", " output_path=output_path,\n", " code_location=code_location,\n", " hyperparameters=hyperparameters,\n", " instance_type=\"ml.c5.xlarge\",\n", " instance_count=1,\n", " framework_version=\"0.90-2\",\n", " py_version=\"py3\",\n", " role=role,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### SageMaker AMT (HPO)\n", "\n", "Now it's time to run the HPO job to train and find the best model \n", "\n", "#### NOTE: THIS CELL WILL RUN FOR APPROX. 5-8 MINUTES! PLEASE BE PATIENT." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "#Retrieve variables stored in first notebook \n", "%store -r train_data_path\n", "%store -r val_data_path\n", "%store -r model_path" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "...................................................................................!\n", "!\n" ] } ], "source": [ "# Automatic Model Tuning (HPO)\n", "from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner\n", "\n", "hyperparameter_ranges = {'eta': ContinuousParameter(1e-3, 1, 'Logarithmic'),\n", " 'min_child_weight': ContinuousParameter(1e-6, 10, 'Logarithmic'),\n", " 'gamma': ContinuousParameter(1e-6, 10, 'Logarithmic'),\n", " 'max_depth': IntegerParameter(2, 10, 'Linear')}\n", "\n", "objective_metric_name = 'validation:auc'\n", "\n", "tuner = HyperparameterTuner(estimator,\n", " objective_metric_name,\n", " hyperparameter_ranges,\n", " max_jobs=100, # 100\n", " max_parallel_jobs=10, #16\n", " )\n", "\n", "tuning_job_name = \"xgb-tuner-{}\".format(strftime(\"%d-%H-%M-%S\", gmtime()))\n", "\n", "inputs = {'train': train_data_path, 'validation': val_data_path}\n", "\n", "tuner.fit(inputs, job_name=tuning_job_name)\n", "\n", "tuner.wait()\n", "\n", "tuner_metrics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The best Area Under Curve (AUC) score, alongside the Disparate Impact was found by going into the SageMaker logs where they are saved. You can find the logs by following the image guide below" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the SageMaker console page navigate the left menu and find Training -> Hyperparameter Tuning Jobs. That will bring up all the recent Tuuning jobs and you can click the latest one. \n", "![Step1](images/Step1A.jpg)\n", "\n", "Scrolling down the tuning job page will bring you to “Best training job summary”. Here you can see the name of the best training job and the optimized value of the metric specified. In this case you can see that the AUC value is optimized at 0.91. Click on the best training job name.\n", "![Step2](images/Step2A.jpg)\n", "\n", "Scrolling down the page of the best training job, you will find a section under “Monitor” called “View logs”. Click on this to take you to CloudWatch logs.\n", "![Step3](images/Step3A.jpg)\n", "\n", "Scrolling down this current page you will encounter a section called “Log streams”. Click the name of your training job under this section.\n", "![Step4](images/Step4A.jpg)\n", "\n", "On this current page we can scroll all the way down and view all of the values the AI model has tried, the final values are at the bottom and are the optimized ones. \n", "![Step5](images/Step5A.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### As can be seen we have a very high AUC Score which is due to the optimization of the tuning job. The tuning was not optimized for DI therefor the value is not the best (lowest) as can be. Please refer back to these values as you will require them to complete the lab in the next notebook. \n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "#Best AUC Score for single Metric Optimization \n", "auc_score = 0.74\n", "#Acompanying Disparate Impact\n", "dppl = 0.75" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets now create a model endpoint to make predictions based on the best training job we have obtained" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'s3://sagemaker-us-east-2-921553072635/sagemaker/sagemaker-amt-credit-risk-model/data/preprocessed/val/val_features.csv'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Get Processed Validation Dataset\n", "n = val_data_path+\"val_features.csv\"\n", "test_dataset = S3Downloader.read_file(n)\n", "n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "44154\n", "221\n" ] } ], "source": [ "print(len(test_dataset))\n", "test_dataset = test_dataset.split(\"\\n\")\n", "print(len(test_dataset))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "for i in range(len(test_dataset)-1):\n", " test_dataset[i] = np.fromstring(test_dataset[i], dtype=float, sep=',')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "0\n" ] } ], "source": [ "count0 = 0\n", "count1 = 0\n", "for i in range(len(test_dataset)-1):\n", " if test_dataset[i][40] == 0:\n", " count0=count0+1\n", " if test_dataset[i][40] == 1:\n", " count1=count1+1\n", "print(count0)\n", "print(count1)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'xgb-tuner-26-17-19-04-033-b11b1c0f'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#TESTING ABOVE\n", "tuner.best_training_job()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "xgboost_model_data = (\n", " \"s3://{}/{}/{}/{}\".format(bucket, prefix, \"xgb_model\", tuner.best_training_job()) + \"/output/model.tar.gz\"\n", ")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "#TESTING\n", "xgboost_inference_code_location = \"s3://{}/{}/{}/code\".format(bucket, prefix, \"xgb_model\")\n", "\n", "xgboost_model = XGBoostModel(\n", " name=\"xgb-model-{0}\".format(str(int(time.time()))),\n", " model_data=xgboost_model_data,\n", " entry_point=\"inference.py\",\n", " source_dir=\"inference/xgboost/\",\n", " code_location=xgboost_inference_code_location,\n", " framework_version=\"0.90-2\",\n", " py_version=\"py3\",\n", " role=role,\n", " sagemaker_session=session,\n", ")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "EndpointName = SINGLE-2023-01-26-1726\n" ] } ], "source": [ "from datetime import datetime\n", "\n", "endpoint_name = f\"SINGLE-{datetime.utcnow():%Y-%m-%d-%H%M}\"\n", "print(\"EndpointName =\", endpoint_name)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-----!" ] } ], "source": [ "#Create Endpoint Instance\n", "\n", "#Instance Count\n", "initial_instance_count=1 # Example\n", "#instance Type\n", "instance_type='ml.m5.large' # Example\n", "\n", "FinalModel = xgboost_model.deploy(\n", " initial_instance_count=initial_instance_count,\n", " instance_type=instance_type,\n", " endpoint_name=endpoint_name\n", ")" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "#Set up predictor \n", "from sagemaker.serializers import CSVSerializer\n", "predictor = sagemaker.predictor.Predictor(\n", " endpoint_name,\n", " session,\n", " serializer=sagemaker.serializers.CSVSerializer(),\n", " deserializer=sagemaker.deserializers.CSVDeserializer(),\n", ")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "#Make Predictions on the test dataset\n", "Single_Optimized_Predictions = []\n", "for i in range(len(test_dataset)-1):\n", " Single_Optimized_Predictions.append(predictor.predict(test_dataset[i]))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[['0.7337338']]\n", "[['0.94518006']]\n", "[['0.9456592']]\n" ] } ], "source": [ "#See first 3 predictions and how they look. We have the double brackets meaning each element is a 3 dimensional element, we will make them a single dimension in the next cell \n", "print(Single_Optimized_Predictions[0])\n", "print(Single_Optimized_Predictions[1])\n", "print(Single_Optimized_Predictions[2])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "#Turn each item into a single dimension element in the list \n", "Predictions = []\n", "for i in range(len(Single_Optimized_Predictions)):\n", " Predictions.append(Single_Optimized_Predictions[i][0][0])" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.7337338\n", "0.94518006\n", "0.9456592\n" ] } ], "source": [ "#Look at first 3 processed predictions. They are now i nthe format we want\n", "print(Predictions[0])\n", "print(Predictions[1])\n", "print(Predictions[2])" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "#Each prediction value is between 0 - 1. But the target variable \"Credit Risk\" is either bad (0) or good (1) there for we will round up/down to ensure every value is either 0 or 1\n", "for i in range(len(Predictions)):\n", " if (float(Predictions[i]) >= float(0.5)):\n", " Predictions[i] = 1 \n", " elif (float(Predictions[i]) < float(0.5)):\n", " Predictions[i] = 0" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n", "1\n", "1\n" ] } ], "source": [ "#Look at first 3 predictions now to confirm we will be able to see they are either 0 or 1 \n", "print(Predictions[0])\n", "print(Predictions[1])\n", "print(Predictions[2])" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "68\n", "152\n" ] } ], "source": [ "#Check how many predictions were 0 (Bad Credit) and 1(Good Credit)\n", "print(Predictions.count(0))\n", "print(Predictions.count(1))" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#Graph the predictions as a pie graph\n", "from matplotlib import pyplot as plt\n", "\n", "slices = [Predictions.count(0),Predictions.count(1)]\n", "labels = ['Bad Credit', 'Good Credit']\n", "colors = ['red', 'green']\n", "explode = [0, 0.1]\n", "plt.pie(slices, labels=labels, autopct='%1.1f%%', explode=explode, colors=colors, wedgeprops={'edgecolor': 'black'})\n", "plt.title(\"Credit Risk Ratio For a Single Optimized Model - All\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.0" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_dataset[0][-1]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "36" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Now let us take a look at Foreign worker predictions only. the last element in the list represents Foreign workers (1 means foreign 2 means non-foreign)\n", "filtered_dataset=[]\n", "for i in range(len(test_dataset)-1):\n", " if test_dataset[i][-1] == 1:\n", " filtered_dataset.append(test_dataset[i])\n", "len(filtered_dataset)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "#Make Predictions on the filtered test dataset with only Foreign Workers\n", "Single_F_Predictions = []\n", "for i in range(len(filtered_dataset)):\n", " Single_F_Predictions.append(predictor.predict(filtered_dataset[i]))" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "#Turn each item into a single dimension element in the list \n", "for i in range(len(Single_F_Predictions)):\n", " Single_F_Predictions[i]=(Single_F_Predictions[i][0][0])" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['0.2862774',\n", " '0.06912269',\n", " '0.29497668',\n", " '0.075319745',\n", " '0.24426395',\n", " '0.036359027',\n", " '0.02628616',\n", " '0.12667994',\n", " '0.030781768',\n", " '0.02682332',\n", " '0.2110773',\n", " '0.19238576',\n", " '0.22990663',\n", " '0.081512764',\n", " '0.3376731',\n", " '0.028388042',\n", " '0.061994657',\n", " '0.8415497',\n", " '0.37680793',\n", " '0.019338168',\n", " '0.48963562',\n", " '0.1564905',\n", " '0.015742503',\n", " '0.14973402',\n", " '0.118518226',\n", " '0.21958497',\n", " '0.053401366',\n", " '0.0205502',\n", " '0.7068394',\n", " '0.07206859',\n", " '0.06789858',\n", " '0.82920635',\n", " '0.72153616',\n", " '0.13628021',\n", " '0.011861026',\n", " '0.03210104']" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Single_F_Predictions" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "#Each prediction value is between 0 - 1. But the target variable \"Credit Risk\" is either bad (0) or good (1) there for we will round up/down to ensure every value is either 0 or 1\n", "for i in range(len(Single_F_Predictions)):\n", " if (float(Single_F_Predictions[i]) >= float(0.5)):\n", " Single_F_Predictions[i] = 1 \n", " elif (float(Single_F_Predictions[i]) < float(0.5)):\n", " Single_F_Predictions[i] = 0" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "32\n", "4\n" ] } ], "source": [ "#Check how many predictions were 0 (Bad Credit) and 1(Good Credit)\n", "print(Single_F_Predictions.count(0))\n", "print(Single_F_Predictions.count(1))" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#Pie Graph\n", "slices = [Single_F_Predictions.count(0),Single_F_Predictions.count(1)]\n", "labels = ['Bad Credit', 'Good Credit']\n", "colors = ['red', 'green']\n", "explode = [0, 0.1]\n", "plt.pie(slices, labels=labels, autopct='%1.1f%%', explode=explode, colors=colors, wedgeprops={'edgecolor': 'black'})\n", "plt.title(\"Credit Risk Ratio For a Single Optimized Model - Foreign\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'s3://sagemaker-us-east-2-921553072635/sagemaker/sagemaker-amt-credit-risk-model/data/preprocessed/val/val_labels.csv'" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Read labels of validation set\n", "m = val_data_path+\"val_labels.csv\"\n", "test_labels = S3Downloader.read_file(m)\n", "m" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "test_labels = test_labels.split(\"\\n\")" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "221\n" ] }, { "data": { "text/plain": [ "''" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#See length of test_labels\n", "print(len(test_labels))\n", "#Take out last element in test labels since it is an empty quote\n", "test_labels.pop()" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100\n", "120\n" ] } ], "source": [ "#See ratio of labels\n", "print(test_labels.count('0'))\n", "print(test_labels.count('1'))" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "36\n", "148\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#TESTING - Non Foreign Workers\n", "filtered_dataset=[]\n", "for i in range(len(test_dataset)-1):\n", " if test_dataset[i][-1] == 2:\n", " filtered_dataset.append(test_dataset[i])\n", "len(filtered_dataset)\n", "\n", "Single_F_Predictions = []\n", "for i in range(len(filtered_dataset)):\n", " Single_F_Predictions.append(predictor.predict(filtered_dataset[i]))\n", " \n", "for i in range(len(Single_F_Predictions)):\n", " Single_F_Predictions[i]=(Single_F_Predictions[i][0][0])\n", " \n", "for i in range(len(Single_F_Predictions)):\n", " if (float(Single_F_Predictions[i]) >= float(0.5)):\n", " Single_F_Predictions[i] = 1 \n", " elif (float(Single_F_Predictions[i]) < float(0.5)):\n", " Single_F_Predictions[i] = 0\n", "\n", "print(Single_F_Predictions.count(0))\n", "print(Single_F_Predictions.count(1))\n", "\n", "#Pie Graph\n", "slices = [Single_F_Predictions.count(0),Single_F_Predictions.count(1)]\n", "labels = ['Bad Credit', 'Good Credit']\n", "colors = ['red', 'green']\n", "explode = [0, 0.1]\n", "plt.pie(slices, labels=labels, autopct='%1.1f%%', explode=explode, colors=colors, wedgeprops={'edgecolor': 'black'})\n", "plt.title(\"Credit Risk Ratio For a Single Optimized Model - Non-Foreign\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "#CLEAN UP - Uncomment code in this cell to Spin Down Endpoint\n", "\n", "# Create a low-level SageMaker service client.\n", "#sagemaker_client = boto3.client('sagemaker', region_name=\"us-east-2\")\n", "\n", "# Delete endpoint\n", "#sagemaker_client.delete_endpoint(EndpointName=endpoint_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### You may now move onto \"3-Multi-Objective-AMT.ipynb\" to explore Multi Objective Optimization tuning jobs to optimize both AUC & DI" ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science 2.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/sagemaker-data-science-38" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 4 }