{ "cells": [ { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### Multi Objective Optimization with Credit Risk Analysis\n", "Here we will optimize our AI model for multiple objectives - Performance with Area Under Curve (AUC), and Fairness with Statistical Disparity (SD)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "#Cell magic to hide outputs \n", "\n", "#Install dependencies\n", "from io import StringIO\n", "import os\n", "import time\n", "import sys\n", "import IPython\n", "from time import gmtime, strftime\n", "\n", "import boto3\n", "import numpy as np\n", "import pandas as pd\n", "import urllib\n", "\n", "import sagemaker\n", "from sagemaker.s3 import S3Uploader\n", "from sagemaker.processing import ProcessingInput, ProcessingOutput\n", "from sagemaker.sklearn.processing import SKLearnProcessor\n", "from sagemaker.inputs import TrainingInput\n", "from sagemaker.xgboost import XGBoost\n", "from sagemaker.s3 import S3Downloader\n", "from sagemaker.s3 import S3Uploader\n", "from sagemaker import Session\n", "from sagemaker import get_execution_role\n", "from sagemaker.xgboost import XGBoostModel\n", "from sagemaker.sklearn import SKLearnModel\n", "from sagemaker.pipeline import PipelineModel\n", "from sagemaker.inputs import TrainingInput\n", "\n", "\n", "session = Session()\n", "bucket = session.default_bucket()\n", "prefix = \"sagemaker/sagemaker-amt-credit-risk-model\"\n", "region = session.boto_region_name\n", "\n", "# Define IAM role\n", "role = get_execution_role()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# !pygmentize processing/preprocessor.py" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Train XGBoost Model\n", "In this step, we will train an XGBoost model on the preprocessed data. We will use our own training script with the built-in XGBoost container provided by SageMaker.\n", "\n", "Alternatively, for your own use case, you can also bring your own model (trained elsewhere) to SageMaker for processing with SageMaker Clarify\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up XGBoost Estimator\n", "\n", "Next, let us set up: \n", " 1. Pre-defined values for Hyperparameters for XGBoost algorithm\n", " 1. XGBoost Estimator for SageMaker\n", "\n", " " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "hyperparameters = {\n", " \"max_depth\": \"5\",\n", " \"eta\": \"0.1\",\n", " \"gamma\": \"4\",\n", " \"min_child_weight\": \"6\",\n", " \"silent\": \"1\",\n", " \"objective\": \"binary:logistic\",\n", " \"num_round\": \"100\",\n", " \"subsample\": \"0.8\",\n", " \"eval_metric\": \"auc\",\n", " \"early_stopping_rounds\": \"20\",\n", " \"output_data_dir\": \"/opt/ml/output/data/\",\n", "}\n", "\n", "entry_point = \"train_xg_amt_multi.py\"\n", "source_dir = \"training/\"\n", "output_path = \"s3://{0}/{1}/{2}\".format(bucket, prefix, \"xgb_model\")\n", "code_location = \"s3://{0}/{1}/code\".format(bucket, prefix)\n", "\n", "estimator = XGBoost(\n", " entry_point=entry_point,\n", " source_dir=source_dir,\n", " output_path=output_path,\n", " code_location=code_location,\n", " hyperparameters=hyperparameters,\n", " instance_type=\"ml.c5.xlarge\",\n", " instance_count=1,\n", " framework_version=\"0.90-2\",\n", " py_version=\"py3\",\n", " role=role,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### SageMaker AMT (HPO)\n", "\n", "Now it's time to run the HPO job to train and find the best model \n", "\n", "#### NOTE: THIS CELL WILL RUN FOR APPROX. 5-8 MINUTES! PLEASE BE PATIENT." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "%store -r train_data_path\n", "%store -r val_data_path\n", "%store -r model_path" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "...................................................................................................................................................!\n", "!\n" ] } ], "source": [ "# Automatic Model Tuning (HPO)\n", "\n", "# output_data_dir = 's3://sagemaker-us-east-2-921553072635/sagemaker/sagemaker-amt-credit-risk-model/data/output/'\n", "\n", "from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner\n", "\n", "hyperparameter_ranges = {'eta': ContinuousParameter(1e-3, 1, 'Logarithmic'),\n", " 'min_child_weight': ContinuousParameter(1e-6, 10, 'Logarithmic'),\n", " 'gamma': ContinuousParameter(1e-6, 10, 'Logarithmic'),\n", " 'max_depth': IntegerParameter(2, 10, 'Linear')}\n", "\n", "objective_metric_name = 'validation:auc'\n", "\n", "tuner = HyperparameterTuner(estimator,\n", " objective_metric_name,\n", " hyperparameter_ranges,\n", " max_jobs=100, # 100\n", " max_parallel_jobs=10, #16\n", " )\n", "\n", "tuning_job_name = \"xgb-tuner-{}\".format(strftime(\"%d-%H-%M-%S\", gmtime()))\n", "\n", "# inputs = {'train': train_data_path, 'validation': val_data_path, 'output_data_dir': output_data_dir}\n", "\n", "inputs = {'train': train_data_path, 'validation': val_data_path}\n", "\n", "tuner.fit(inputs, job_name=tuning_job_name)\n", "\n", "tuner.wait()\n", "\n", "tuner_metrics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'xgb-tuner-27-00-32-14-052-be514283'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tuner.best_training_job()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "xgboost_model_data = (\n", " \"s3://{}/{}/{}/{}\".format(bucket, prefix, \"xgb_model\", tuner.best_training_job()) + \"/output/model.tar.gz\"\n", ")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "xgboost_inference_code_location = \"s3://{}/{}/{}/code\".format(bucket, prefix, \"xgb_model\")\n", "\n", "xgboost_model = XGBoostModel(\n", " name=\"xgb-model-{0}\".format(str(int(time.time()))),\n", " model_data=xgboost_model_data,\n", " entry_point=\"inference.py\",\n", " source_dir=\"inference/xgboost/\",\n", " code_location=xgboost_inference_code_location,\n", " framework_version=\"0.90-2\",\n", " py_version=\"py3\",\n", " role=role,\n", " sagemaker_session=session,\n", ")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "EndpointName = MULTI-2023-01-27-0044\n" ] } ], "source": [ "from datetime import datetime\n", "\n", "endpoint_name = f\"MULTI-{datetime.utcnow():%Y-%m-%d-%H%M}\"\n", "print(\"EndpointName =\", endpoint_name)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-----!" ] } ], "source": [ "#Create Endpoint Instance\n", "\n", "#Instance Count\n", "initial_instance_count=1 # Example\n", "#instance Type\n", "instance_type='ml.m5.large' # Example\n", "\n", "FinalModel = xgboost_model.deploy(\n", " initial_instance_count=initial_instance_count,\n", " instance_type=instance_type,\n", " endpoint_name=endpoint_name\n", ")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'s3://sagemaker-us-east-2-921553072635/sagemaker/sagemaker-amt-credit-risk-model/data/preprocessed/val/val_features.csv'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Validation\n", "n = val_data_path+\"val_features.csv\"\n", "test_dataset = S3Downloader.read_file(n)\n", "n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "44157\n", "221\n" ] } ], "source": [ "print(len(test_dataset))\n", "test_dataset = test_dataset.split(\"\\n\")\n", "print(len(test_dataset))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "for i in range(len(test_dataset)-1):\n", " test_dataset[i] = np.fromstring(test_dataset[i], dtype=float, sep=',')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "#Set up predictor \n", "from sagemaker.serializers import CSVSerializer\n", "predictor = sagemaker.predictor.Predictor(\n", " endpoint_name,\n", " session,\n", " serializer=sagemaker.serializers.CSVSerializer(),\n", " deserializer=sagemaker.deserializers.CSVDeserializer(),\n", ")" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "#Make Predictions on the test dataset\n", "Multi_Optimized_Predictions = []\n", "for i in range(len(test_dataset)-1):\n", " Multi_Optimized_Predictions.append(predictor.predict(test_dataset[i]))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[['0.9327612']]\n", "[['0.86414945']]\n", "[['0.9440027']]\n" ] } ], "source": [ "#See first 3 predictions and how they look. We have the double brackets meaning each element is a 3 dimensional element, we will make them a single dimension in the next cell \n", "print(Multi_Optimized_Predictions[0])\n", "print(Multi_Optimized_Predictions[1])\n", "print(Multi_Optimized_Predictions[2])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "#Turn each item into a single dimension element in the list \n", "Predictions = []\n", "for i in range(len(Multi_Optimized_Predictions)):\n", " Predictions.append(Multi_Optimized_Predictions[i][0][0])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9327612\n", "0.86414945\n", "0.9440027\n" ] } ], "source": [ "#Look at first 3 processed predictions. They are now i nthe format we want\n", "print(Predictions[0])\n", "print(Predictions[1])\n", "print(Predictions[2])" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "#Each prediction value is between 0 - 1. But the target variable \"Credit Risk\" is either bad (0) or good (1) there for we will round up/down to ensure every value is either 0 or 1\n", "for i in range(len(Predictions)):\n", " if (float(Predictions[i]) >= float(0.5)):\n", " Predictions[i] = 1 \n", " elif (float(Predictions[i]) < float(0.5)):\n", " Predictions[i] = 0" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n", "1\n", "1\n" ] } ], "source": [ "#Look at first 3 predictions now to confirm we will be able to see they are either 0 or 1 \n", "print(Predictions[0])\n", "print(Predictions[1])\n", "print(Predictions[2])" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "71\n", "149\n" ] } ], "source": [ "#Check how many predictions were 0 (Bad Credit) and 1(Good Credit)\n", "print(Predictions.count(0))\n", "print(Predictions.count(1))" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#Graph the predictions as a pie graph\n", "from matplotlib import pyplot as plt\n", "\n", "slices = [Predictions.count(0),Predictions.count(1)]\n", "labels = ['Bad Credit', 'Good Credit']\n", "colors = ['red', 'green']\n", "explode = [0, 0.1]\n", "plt.pie(slices, labels=labels, autopct='%1.1f%%', explode=explode, colors=colors, wedgeprops={'edgecolor': 'black'})\n", "plt.title(\"Credit Risk Ratio For a Multi Optimized Model - All\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "36" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Now let us take a look at Foreign worker predictions only. the last element in the list represents Foreign workers (1 means foreign 2 means non-foreign)\n", "filtered_dataset=[]\n", "for i in range(len(test_dataset)-1):\n", " if test_dataset[i][-1] == 1:\n", " filtered_dataset.append(test_dataset[i])\n", "len(filtered_dataset)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "36" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(filtered_dataset)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "#Make Predictions on the test dataset\n", "Multi_F_Predictions = []\n", "for i in range(len(filtered_dataset)):\n", " Multi_F_Predictions.append(predictor.predict(filtered_dataset[i]))" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "#Turn each item into a single dimension element in the list \n", "for i in range(len(Multi_F_Predictions)):\n", " Multi_F_Predictions[i]=(Multi_F_Predictions[i][0][0])" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['0.24157475',\n", " '4.7756188e-05',\n", " '0.06342731',\n", " '0.93585235',\n", " '6.523058e-05',\n", " '0.5261048',\n", " '0.7142568',\n", " '0.014334942',\n", " '0.017389935',\n", " '9.352473e-05',\n", " '0.6630546',\n", " '0.4402347',\n", " '0.044109173',\n", " '0.13703965',\n", " '0.13513479',\n", " '0.09022986',\n", " '0.27975675',\n", " '0.7313284',\n", " '0.8185003',\n", " '0.123835415',\n", " '0.042360604',\n", " '0.000822239',\n", " '0.0046404405',\n", " '0.03956926',\n", " '0.932917',\n", " '0.76213884',\n", " '0.00038311264',\n", " '0.12971416',\n", " '0.90214473',\n", " '0.0013121088',\n", " '0.0013635819',\n", " '0.8794973',\n", " '0.026009575',\n", " '0.08722933',\n", " '0.85275924',\n", " '0.0009555685']" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Multi_F_Predictions" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "#Each prediction value is between 0 - 1. But the target variable \"Credit Risk\" is either bad (0) or good (1) there for we will round up/down to ensure every value is either 0 or 1\n", "for i in range(len(Multi_F_Predictions)):\n", " if (float(Multi_F_Predictions[i]) >= float(0.5)):\n", " Multi_F_Predictions[i] = 1 \n", " elif (float(Multi_F_Predictions[i]) < float(0.5)):\n", " Multi_F_Predictions[i] = 0" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "25\n", "11\n" ] } ], "source": [ "#Check how many predictions were 0 (Bad Credit) and 1(Good Credit)\n", "print(Multi_F_Predictions.count(0))\n", "print(Multi_F_Predictions.count(1))" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#Pie Graph\n", "slices = [Multi_F_Predictions.count(0),Multi_F_Predictions.count(1)]\n", "labels = ['Bad Credit', 'Good Credit']\n", "colors = ['red', 'green']\n", "explode = [0, 0.1]\n", "plt.pie(slices, labels=labels, autopct='%1.1f%%', explode=explode, colors=colors, wedgeprops={'edgecolor': 'black'})\n", "plt.title(\"Credit Risk Ratio For a Multi Optimized Model - Foreign\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "46\n", "138\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#TESTING - Non Foreign Workers\n", "filtered_dataset=[]\n", "for i in range(len(test_dataset)-1):\n", " if test_dataset[i][-1] == 2:\n", " filtered_dataset.append(test_dataset[i])\n", "len(filtered_dataset)\n", "\n", "Single_F_Predictions = []\n", "for i in range(len(filtered_dataset)):\n", " Single_F_Predictions.append(predictor.predict(filtered_dataset[i]))\n", " \n", "for i in range(len(Single_F_Predictions)):\n", " Single_F_Predictions[i]=(Single_F_Predictions[i][0][0])\n", " \n", "for i in range(len(Single_F_Predictions)):\n", " if (float(Single_F_Predictions[i]) >= float(0.5)):\n", " Single_F_Predictions[i] = 1 \n", " elif (float(Single_F_Predictions[i]) < float(0.5)):\n", " Single_F_Predictions[i] = 0\n", "\n", "print(Single_F_Predictions.count(0))\n", "print(Single_F_Predictions.count(1))\n", "\n", "#Pie Graph\n", "slices = [Single_F_Predictions.count(0),Single_F_Predictions.count(1)]\n", "labels = ['Bad Credit', 'Good Credit']\n", "colors = ['red', 'green']\n", "explode = [0, 0.1]\n", "plt.pie(slices, labels=labels, autopct='%1.1f%%', explode=explode, colors=colors, wedgeprops={'edgecolor': 'black'})\n", "plt.title(\"Credit Risk Ratio For a Single Optimized Model - Non-Foreign\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "# Generate Pareto Frontier \n", "def get_pareto_frontier(Xs, Ys, maxX=False, maxY=True):\n", " \"\"\"plot Pareto front\n", "\n", " in AMT this is impossible, since there is only one merged metric, But I\n", " show here just from a scientific viewpoint\n", "\n", " \"\"\"\n", " sorted_list = sorted([[Xs[i], Ys[i]] for i in range(len(Xs))], reverse=maxX)\n", " pareto_front = [sorted_list[0]]\n", " for pair in sorted_list[1:]:\n", " if maxY:\n", " if pair[1] >= pareto_front[-1][1]:\n", " pareto_front.append(pair)\n", " else:\n", " if pair[1] <= pareto_front[-1][1]:\n", " pareto_front.append(pair)\n", "\n", " pf_X = [pair[0] for pair in pareto_front]\n", " pf_Y = [pair[1] for pair in pareto_front]\n", " return pf_X, pf_Y" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "s3.ObjectSummary(bucket_name='sagemaker-us-east-2-921553072635', key='sagemaker/sagemaker-amt-credit-risk-model/data/output/tmp/02397abf-output.txt')" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# S3 bucket's tmp folder is populated by results from training jobs run by HPO (Automatic Model Tuning)\n", "import boto3\n", "import pandas as pd\n", "import io\n", "\n", "s3 = boto3.resource('s3')\n", "bucket = s3.Bucket(session.default_bucket())\n", "prefix_objs = bucket.objects.filter(Prefix=\"sagemaker/sagemaker-amt-credit-risk-model/data/output/tmp/\")\n", "objects = iter(prefix_objs)\n", "next(objects)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "prefix_df = []\n", "\n", "from io import BytesIO\n", "\n", "for obj in objects:\n", " key = obj.key\n", " body = obj.get()['Body'].read()\n", " tmp = str(body, 'UTF-8').strip()\n", " prefix_df.append(tmp)\n", " \n", "# prefix_df" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(prefix_df)\n", "df['models_list'] = df[0].str.split(',')\n", "df_models = pd.DataFrame(df['models_list'].tolist()).fillna('').add_prefix('model_')\n", "df_models.rename(columns={'model_0': 'sd', 'model_1': 'auc'}, inplace=True)\n", "df_models.to_csv('sd_auc_pareto_frontier.csv', index=False, header=True)\n", "sd_str_list = df_models.iloc[:, 0].tolist()\n", "sd_float_list = [float(x) for x in sd_str_list]\n", "auc_str_list = df_models.iloc[:, -1].tolist()\n", "auc_float_list = [float(x) for x in auc_str_list]" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "pf_X, pf_Y = get_pareto_frontier(Xs=sd_float_list,Ys=auc_float_list)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "#Eliminate all 0's in pf_X and corresponding points in pf_Y \n", "i = 0 \n", "while i < len(pf_X):\n", " if pf_X[i] != 0:\n", " i += 1\n", " else:\n", " del(pf_X[i])\n", " del(pf_Y[i])\n", "#Eliminate all 0's in sd_float_list and corresponding points in auc_float_listY \n", "i = 0 \n", "while i < len(sd_float_list):\n", " if sd_float_list[i] != 0:\n", " i += 1\n", " else:\n", " del(sd_float_list[i])\n", " del(auc_float_list[i])" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Baseline Statistical Disparity: 0.75\n", "Baseline Area Under Curve: 0.74 \n", "\n", "Optimized Statistical Disparity: 0.43\n", "Optimized Area Under Curve: 0.72 \n", "\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "from matplotlib import pyplot as plt\n", "\n", "#Worst one wihtout optimization - You can get thess baseline value from the values you found in Notebook 2\n", "baseline_sd = 0.75 \n", "baseline_auc = 0.74\n", "\n", "#REAL ONE After optimization - follow steps from Notebook 2 for the training job created in this notebook to find these values. \n", "optimized_sd = 0.43\n", "optimized_auc = 0.72\n", "#combined = 0.75\n", "ax = plt.subplots()\n", "#Add grid \n", "plt.grid()\n", "\n", "#Graph pareto front chart \n", "plt.title(\"Pareto Chart\")\n", "plt.xlabel(\"Difference in Positive Proportions in Predicted Labels (DPPL)\")\n", "plt.ylabel(\"Area Under Curve (AUC)\")\n", "\n", "#Plot co ordinates \n", "plt.plot(pf_X, pf_Y, linestyle='--', linewidth=3, marker='o', c='#8B8000', mfc='b')\n", "#Plot additional points\n", "plt.plot(sd_float_list, auc_float_list, linestyle='None', marker='o', markersize=3, color='#86C5DA')\n", "#Show baseline SD & AUC \n", "plt.plot(baseline_sd, baseline_auc, \"ro\", markersize='10', label=(\"Baseline (\"+str(baseline_sd)+\", \"+str(baseline_auc)+\")\"))\n", "#Show optimized SD & AUC \n", "plt.plot(optimized_sd, optimized_auc, \"gv\", markersize='10', label=(\"Optimized (\"+str(optimized_sd)+\", \"+str(optimized_auc)+\")\"))\n", "#Legend\n", "plt.legend(loc=\"lower left\")\n", "\n", "#Print Baseline Values \n", "print(\"Baseline Statistical Disparity: \",baseline_sd)\n", "print(\"Baseline Area Under Curve: \",baseline_auc,\"\\n\")\n", "\n", "#Print Optimized Values\n", "print(\"Optimized Statistical Disparity: \",optimized_sd)\n", "print(\"Optimized Area Under Curve: \",optimized_auc,\"\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The cell below provides a range of visualization for the tuning job above. This visualization and its libraries were provided by the following repo:\n", "\n", "https://github.com/aws-samples/amazon-sagemaker-amt-visualize/blob/main/1_tuning_of_builtin_xgboost.ipynb " ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Setting altair renderer to default.\n", "Tuning job xgb-tuner-27-00-32-14 status: Completed\n", "\n", "Number of training jobs with valid objective: 100\n", "Lowest: 0.5139889717102051 Highest 0.6838629841804504\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
etagammamax_depthmin_child_weightTrainingJobNameTrainingJobStatusTrainingStartTimeTrainingEndTimeTrainingElapsedTimeSecondsTuningJobNamevalidation:auc
480.5486120.0124214.00.000066xgb-tuner-27-00-32-14-052-be514283Completed2023-01-27 00:39:38+00:002023-01-27 00:40:47+00:0069.0xgb-tuner-27-00-32-140.683863
680.5444510.0009044.00.000078xgb-tuner-27-00-32-14-032-c6499eb4Completed2023-01-27 00:37:02+00:002023-01-27 00:37:40+00:0038.0xgb-tuner-27-00-32-140.678610
710.6438510.0000884.00.000003xgb-tuner-27-00-32-14-029-43d2f514Completed2023-01-27 00:36:36+00:002023-01-27 00:37:08+00:0032.0xgb-tuner-27-00-32-140.677474
860.6483770.0000984.00.000018xgb-tuner-27-00-32-14-014-7c090c65Completed2023-01-27 00:35:18+00:002023-01-27 00:35:50+00:0032.0xgb-tuner-27-00-32-140.669266
810.7185540.0549624.00.000013xgb-tuner-27-00-32-14-019-b6885f1dCompleted2023-01-27 00:35:35+00:002023-01-27 00:36:12+00:0037.0xgb-tuner-27-00-32-140.667242
350.5248440.0667383.00.000008xgb-tuner-27-00-32-14-065-f1f06d11Completed2023-01-27 00:40:21+00:002023-01-27 00:40:58+00:0037.0xgb-tuner-27-00-32-140.665081
870.6117080.0000914.00.004199xgb-tuner-27-00-32-14-013-e0d8f46aCompleted2023-01-27 00:35:16+00:002023-01-27 00:35:53+00:0037.0xgb-tuner-27-00-32-140.663159
440.7305830.0026274.00.000487xgb-tuner-27-00-32-14-056-250a354cCompleted2023-01-27 00:39:24+00:002023-01-27 00:40:01+00:0037.0xgb-tuner-27-00-32-140.659589
880.4742160.0000163.00.000453xgb-tuner-27-00-32-14-012-33b1a582Completed2023-01-27 00:35:14+00:002023-01-27 00:35:51+00:0037.0xgb-tuner-27-00-32-140.658692
400.4987170.0002305.00.000013xgb-tuner-27-00-32-14-060-c903e078Completed2023-01-27 00:39:45+00:002023-01-27 00:40:22+00:0037.0xgb-tuner-27-00-32-140.657385
\n", "
" ], "text/plain": [ " eta gamma max_depth min_child_weight TrainingJobName TrainingJobStatus TrainingStartTime TrainingEndTime TrainingElapsedTimeSeconds TuningJobName validation:auc\n", "48 0.548612 0.012421 4.0 0.000066 xgb-tuner-27-00-32-14-052-be514283 Completed 2023-01-27 00:39:38+00:00 2023-01-27 00:40:47+00:00 69.0 xgb-tuner-27-00-32-14 0.683863\n", "68 0.544451 0.000904 4.0 0.000078 xgb-tuner-27-00-32-14-032-c6499eb4 Completed 2023-01-27 00:37:02+00:00 2023-01-27 00:37:40+00:00 38.0 xgb-tuner-27-00-32-14 0.678610\n", "71 0.643851 0.000088 4.0 0.000003 xgb-tuner-27-00-32-14-029-43d2f514 Completed 2023-01-27 00:36:36+00:00 2023-01-27 00:37:08+00:00 32.0 xgb-tuner-27-00-32-14 0.677474\n", "86 0.648377 0.000098 4.0 0.000018 xgb-tuner-27-00-32-14-014-7c090c65 Completed 2023-01-27 00:35:18+00:00 2023-01-27 00:35:50+00:00 32.0 xgb-tuner-27-00-32-14 0.669266\n", "81 0.718554 0.054962 4.0 0.000013 xgb-tuner-27-00-32-14-019-b6885f1d Completed 2023-01-27 00:35:35+00:00 2023-01-27 00:36:12+00:00 37.0 xgb-tuner-27-00-32-14 0.667242\n", "35 0.524844 0.066738 3.0 0.000008 xgb-tuner-27-00-32-14-065-f1f06d11 Completed 2023-01-27 00:40:21+00:00 2023-01-27 00:40:58+00:00 37.0 xgb-tuner-27-00-32-14 0.665081\n", "87 0.611708 0.000091 4.0 0.004199 xgb-tuner-27-00-32-14-013-e0d8f46a Completed 2023-01-27 00:35:16+00:00 2023-01-27 00:35:53+00:00 37.0 xgb-tuner-27-00-32-14 0.663159\n", "44 0.730583 0.002627 4.0 0.000487 xgb-tuner-27-00-32-14-056-250a354c Completed 2023-01-27 00:39:24+00:00 2023-01-27 00:40:01+00:00 37.0 xgb-tuner-27-00-32-14 0.659589\n", "88 0.474216 0.000016 3.0 0.000453 xgb-tuner-27-00-32-14-012-33b1a582 Completed 2023-01-27 00:35:14+00:00 2023-01-27 00:35:51+00:00 37.0 xgb-tuner-27-00-32-14 0.658692\n", "40 0.498717 0.000230 5.0 0.000013 xgb-tuner-27-00-32-14-060-c903e078 Completed 2023-01-27 00:39:45+00:00 2023-01-27 00:40:22+00:00 37.0 xgb-tuner-27-00-32-14 0.657385" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.VConcatChart(...)" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#!pip install altair\n", "from amtviz import visualize_tuning_job\n", "visualize_tuning_job(tuner, trials_only=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clean-up\n", "\n", "Remove any resources to save on AWS cost." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "#CLEAN UP - Uncomment code in this cell to Spin Down Endpoint\n", "\n", "# Create a low-level SageMaker service client.\n", "sagemaker_client = boto3.client('sagemaker', region_name=\"us-east-2\")\n", "\n", "# Delete endpoint\n", "sagemaker_client.delete_endpoint(EndpointName=endpoint_name)\n", "\n", "#Clean-up temporary files in s3 bucket.Removing these temp files will impact re-creating the 'pareto frontier' chart\n", "\n", "import boto3\n", "s3_client = boto3.client('s3')\n", "session = Session()\n", "s3_bucket = session.default_bucket()\n", "s3_bucket_prefix = \"sagemaker/sagemaker-amt-credit-risk-model/data/output/tmp/\"\n", "response = s3_client.list_objects_v2(Bucket=s3_bucket, Prefix=s3_bucket_prefix)\n", "for object in response['Contents']:\n", " s3_client.delete_object(Bucket=s3_bucket, Key=object['Key'])" ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science 2.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/sagemaker-data-science-38" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 4 }