{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![MLU Logo](../../data/MLU_Logo.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Responsible AI - SageMaker Clarify\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Amazon SageMaker Clarify](https://aws.amazon.com/sagemaker/clarify/) helps improve your machine learning models by detecting potential bias and helping explain how these models make predictions. The fairness and explainability functionality provided by SageMaker Clarify takes a step towards enabling AWS customers to build trustworthy and understandable machine learning models. \n", "\n", "In this notebook, we highlight how you can use SageMaker to train models, host them an inference endpoint, and provide bias detection and explainability to analyze data and understand prediction outcomes from the model.\n", "\n", "\n", "__Dataset:__ \n", "The dataset we will use for this exercise is coming from [folktables](https://github.com/zykls/folktables). Folktables provide code to download data from the American Community Survey (ACS) Public Use Microdata Sample (PUMS) files managed by the US Census Bureau. The data itself is governed by the terms of use provided by the Census Bureau. For more information, see the [Terms of Service](https://www.census.gov/data/developers/about/terms-of-service.html).\n", "\n", "__ML Problem:__ \n", "Ultimately, the goal will be to predict whether an individual's income is above \\\\$50,000. We will filter the ACS PUMS data sample to only include individuals above the age of 16, who reported usual working hours of at least 1 hour per week in the past year, and an income of at least \\\\$100. The threshold of \\\\$50,000 was chosen so that this dataset can serve as a comparable substitute to the [UCI Adult dataset](https://archive.ics.uci.edu/ml/datasets/adult). The income threshold can be changed easily to define new prediction tasks.\n", "\n", "\n", "1. Read the dataset\n", "2. Data Processing\n", " * Exploratory Data Analysis\n", " * Select features to build the model\n", " * Feature Transformation\n", " * Train - Validation - Test Datasets\n", " * Data processing with Pipeline and ColumnTransformer\n", "3. Train (and Tune) a Classifier\n", "4. Amazon SageMaker Clarify" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook assumes availability of the SageMaker kernel `conda_pyhon3`. In addition, install folktables and sklearn:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "!pip install -U -q scikit-learn==1.1.3\n", "!pip install -U -q --no-deps folktables==0.0.11" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Reshaping/basic libraries\n", "import pandas as pd\n", "import numpy as np\n", "\n", "# Plotting libraries\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "sns.set_style(\"darkgrid\", {\"axes.facecolor\": \".9\"})\n", "\n", "# ML libraries\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import confusion_matrix, accuracy_score\n", "from sklearn.impute import SimpleImputer\n", "from sklearn.preprocessing import OneHotEncoder, MinMaxScaler\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.compose import ColumnTransformer\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "# Operational libraries\n", "import sys\n", "from io import StringIO\n", "import os\n", "import time\n", "import sys\n", "import IPython\n", "from time import gmtime, strftime\n", "from datetime import datetime, timedelta\n", "\n", "sys.path.append(\"..\")\n", "\n", "# Fairness libraries\n", "from folktables.acs import *\n", "from folktables.folktables import *\n", "from folktables.load_acs import *\n", "\n", "# Jupyter(lab) libraries\n", "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\")\n", "\n", "\n", "# SageMaker and connection libraries\n", "import boto3\n", "import urllib\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "from sagemaker import Session\n", "from sagemaker.amazon.amazon_estimator import get_image_uri\n", "from sagemaker.inputs import TrainingInput\n", "from sagemaker.serializers import CSVSerializer\n", "from sagemaker.s3 import S3Downloader, S3Uploader\n", "from sagemaker import clarify\n", "from sagemaker import model_monitor\n", "from datetime import date\n", "\n", "today = date.today()\n", "bucket = sagemaker.Session().default_bucket()\n", "prfx = \"sagemaker/sagemaker-clarify-income-model\"\n", "region = boto3.Session().region_name\n", "client = boto3.client('sagemaker')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Read the dataset\n", "(Go to top)\n", "\n", "To read in the dataset, we will be using [folktables](https://github.com/zykls/folktables) which provides access to the US Census dataset. Folktables contains predefined prediction tasks but also allows the user to specify the problem type.\n", "\n", "The US Census dataset distinguishes between household and individuals. To obtain data on individuals, we use `ACSDataSource` with `survey=person`. The feature names for the US Census data follow the same distinction and use `P` for `person` and `H` for `household`, e.g.: `AGEP` refers to age of an individual." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "income_features = [\n", " \"AGEP\", # age individual\n", " \"COW\", # class of worker\n", " \"SCHL\", # educational attainment\n", " \"MAR\", # marital status\n", " \"OCCP\", # occupation\n", " \"POBP\", # place of birth\n", " \"RELP\", # relationship\n", " \"WKHP\", # hours worked per week past 12 months\n", " \"SEX\", # sex\n", " \"RAC1P\", # recorded detailed race code\n", " \"PWGTP\", # persons weight\n", " \"GCL\", # grand parents living with granchildren\n", " \"SCH\", # school enrollment\n", "]\n", "\n", "# Define the prediction problem and features\n", "ACSIncome = folktables.BasicProblem(\n", " features=income_features,\n", " target=\"PINCP\", # total persons income\n", " target_transform=lambda x: x > 50000,\n", " group=\"RAC1P\",\n", " preprocess=adult_filter, # applies the following conditions; ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))\n", " postprocess=lambda x: x, # applies post processing, e.g. fill all NAs\n", ")\n", "\n", "# Initialize year, duration (\"1-Year\" or \"5-Year\") and granularity (household or person)\n", "data_source = ACSDataSource(survey_year=\"2018\", horizon=\"1-Year\", survey=\"person\")\n", "# Specify region (here: California) and load data\n", "ca_data = data_source.get_data(states=[\"CA\"], download=True)\n", "# Apply transformation as per problem statement above\n", "ca_features, ca_labels, ca_group = ACSIncome.df_to_numpy(ca_data)\n", "\n", "# Convert numpy array to dataframe\n", "df = pd.DataFrame(\n", " np.concatenate((ca_features, ca_labels.reshape(-1, 1)), axis=1),\n", " columns=income_features + [\">50k\"],\n", ")\n", "\n", "# For further modelling we want to use only 2 groups (see DATAPREP notebook for details)\n", "df = df[df[\"RAC1P\"].isin([6, 8])].copy(deep=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Data Processing\n", "(Go to top)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Exploratory Data Analysis\n", "(Go to Data Processing)\n", "\n", "We look at number of rows, columns, and some simple statistics of the dataset." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AGEPCOWSCHLMAROCCPPOBPRELPWKHPSEXRAC1PPWGTPGCLSCH>50k
030.06.014.01.09610.06.016.040.01.08.032.02.01.00.0
2723.02.021.05.02545.0207.017.020.02.06.035.0NaN3.00.0
3318.01.016.05.09610.06.017.08.02.06.033.0NaN2.00.0
4640.01.015.03.04140.0303.016.022.01.08.038.02.01.00.0
4918.01.018.05.0725.06.017.012.02.06.060.0NaN2.00.0
\n", "
" ], "text/plain": [ " AGEP COW SCHL MAR OCCP POBP RELP WKHP SEX RAC1P PWGTP GCL \\\n", "0 30.0 6.0 14.0 1.0 9610.0 6.0 16.0 40.0 1.0 8.0 32.0 2.0 \n", "27 23.0 2.0 21.0 5.0 2545.0 207.0 17.0 20.0 2.0 6.0 35.0 NaN \n", "33 18.0 1.0 16.0 5.0 9610.0 6.0 17.0 8.0 2.0 6.0 33.0 NaN \n", "46 40.0 1.0 15.0 3.0 4140.0 303.0 16.0 22.0 1.0 8.0 38.0 2.0 \n", "49 18.0 1.0 18.0 5.0 725.0 6.0 17.0 12.0 2.0 6.0 60.0 NaN \n", "\n", " SCH >50k \n", "0 1.0 0.0 \n", "27 3.0 0.0 \n", "33 2.0 0.0 \n", "46 1.0 0.0 \n", "49 2.0 0.0 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print the first five rows\n", "# NaN means missing data\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The shape of the dataset is: (55502, 14)\n" ] } ], "source": [ "# Check how many rows and columns we have in the data frame\n", "print(\"The shape of the dataset is:\", df.shape)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 55502 entries, 0 to 195664\n", "Data columns (total 14 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 AGEP 55502 non-null float64\n", " 1 COW 55502 non-null float64\n", " 2 SCHL 55502 non-null float64\n", " 3 MAR 55502 non-null float64\n", " 4 OCCP 55502 non-null float64\n", " 5 POBP 55502 non-null float64\n", " 6 RELP 55502 non-null float64\n", " 7 WKHP 55502 non-null float64\n", " 8 SEX 55502 non-null float64\n", " 9 RAC1P 55502 non-null float64\n", " 10 PWGTP 55502 non-null float64\n", " 11 GCL 41987 non-null float64\n", " 12 SCH 55502 non-null float64\n", " 13 >50k 55502 non-null float64\n", "dtypes: float64(14)\n", "memory usage: 6.4 MB\n" ] } ], "source": [ "# Let's see the data types and non-null values for each column\n", "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can clearly see that all columns are numerical (`dtype = float64`). However, when checking the column headers (and information at top of the notebook), we should notice that we are actually dealing with multimodal data. We expect to see a mix of categorical, numerical and potentially even text information.\n", "\n", "Let's cast the features accordingly. We start by creating list for each feature type." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "categorical_features = [\n", " \"COW\",\n", " \"SCHL\",\n", " \"MAR\",\n", " \"OCCP\",\n", " \"POBP\",\n", " \"RELP\",\n", " \"SEX\",\n", " \"GCL\",\n", " \"SCH\",\n", "]\n", "\n", "numerical_features = [\"AGEP\", \"WKHP\", \"PWGTP\"]\n", "\n", "sensitive_feature = [\"RAC1P\"]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# We cast categorical features to `category`\n", "df[categorical_features] = df[categorical_features].astype(\"object\")\n", "\n", "# We cast categorical features to `category`\n", "df[sensitive_feature] = df[sensitive_feature].astype(\"object\")\n", "\n", "# We cast numerical features to `int`\n", "df[numerical_features] = df[numerical_features].astype(\"int\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check with `.info()` again to make sure the changes took effect." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 55502 entries, 0 to 195664\n", "Data columns (total 14 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 AGEP 55502 non-null int64 \n", " 1 COW 55502 non-null object \n", " 2 SCHL 55502 non-null object \n", " 3 MAR 55502 non-null object \n", " 4 OCCP 55502 non-null object \n", " 5 POBP 55502 non-null object \n", " 6 RELP 55502 non-null object \n", " 7 WKHP 55502 non-null int64 \n", " 8 SEX 55502 non-null object \n", " 9 RAC1P 55502 non-null object \n", " 10 PWGTP 55502 non-null int64 \n", " 11 GCL 41987 non-null object \n", " 12 SCH 55502 non-null object \n", " 13 >50k 55502 non-null float64\n", "dtypes: float64(1), int64(3), object(10)\n", "memory usage: 6.4+ MB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks good, so we can now separate model features from model target to explore them separately." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model features: ['COW', 'SCHL', 'MAR', 'OCCP', 'POBP', 'RELP', 'SEX', 'GCL', 'SCH', 'AGEP', 'WKHP', 'PWGTP', 'RAC1P']\n", "Model target: >50k\n" ] } ], "source": [ "model_target = \">50k\"\n", "model_features = categorical_features + numerical_features + sensitive_feature\n", "\n", "print(\"Model features: \", model_features)\n", "print(\"Model target: \", model_target)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Double check that that target is not accidentally part of the features\n", "model_target in model_features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All good here. We made sure that the target is not in the feature list. If we find the above statement showing `True` we need to remove the target by calling `model_features.remove(model_target)`.\n", "\n", "Let's have a look at missing values next.\n", "\n", "\n", "#### Missing values\n", "The quickest way to check for missing values is to use `.isna().sum()`. This will provide a count of how many missing values we have. In fact, we can also see the count of missing values with `.info()` as it provided a count of non-null values." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AGEP 0\n", "COW 0\n", "SCHL 0\n", "MAR 0\n", "OCCP 0\n", "POBP 0\n", "RELP 0\n", "WKHP 0\n", "SEX 0\n", "RAC1P 0\n", "PWGTP 0\n", "GCL 13515\n", "SCH 0\n", ">50k 0\n", "dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show missing values\n", "df.isna().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No missing values anywhere apart from GCL - we might need to drop this column." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Target distribution\n", "\n", "Let's check our target distribution." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df[model_target].value_counts().plot.bar(color=\"black\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We notice that we are dealing with an imbalanced dataset. This means there are more examples for one type of results (here: 0; meaning individuals earning $\\leq$ 50k). This is relevant for model choice and potential up-sampling or down-sampling to balance out the classes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Feature distribution(s)\n", "\n", "Let's now plot bar charts for the shortlist features of our dataset. We want to use Seaborns' `pairplot` as this will provide a lot of information about distribution and correlation of features." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.pairplot(\n", " df[numerical_features + [\"RAC1P\"]].sample(1000),\n", " hue=\"RAC1P\",\n", " palette=sns.color_palette(\"husl\", 2),\n", ") # this will plot the numerical features and separate by sensitive attribute\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Select features to build the model\n", "(Go to Data Processing)\n", "\n", "During the extended EDA in the DATAPREP notebook, we learned that `GCL` is a feature that is equally present for both outcome types and also contains a lot of missing values. Therefore, we can drop it from the list of features we want to use for model build. We also drop `OCCP` and `POBP` as those features have too many unique categories." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "to_remove = [\"GCL\", \"OCCP\", \"POBP\"]\n", "\n", "# Drop to_remove features from the respective list(s) - if applicable\n", "for ft in to_remove:\n", " if ft in model_features:\n", " model_features.remove(ft)\n", " if ft in categorical_features:\n", " categorical_features.remove(ft)\n", " if ft in numerical_features:\n", " numerical_features.remove(ft)\n", " if ft in sensitive_feature:\n", " sensitive_feature.remove(ft)\n", " \n", "# Let's also clean up the dataframe and only keep the features and columns we need\n", "df = df[model_features + [model_target]].copy(deep=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 Feature transformation\n", "(Go to Data Processing)\n", "\n", "In this notebook, we won't perform any transformation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4 Train - Validation - Test Datasets\n", "(Go to Data Processing)\n", "\n", "To get a training, test and validation set, we will use sklearn's [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. Before splitting the data, we need to correctly cast categorical and numerical values again. The transformation leads to casting everything as numerical values. We can simply copy the same code we were using in the beginning (make sure to update the dataframe name)." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train - Test - Validation datasets shapes: (42458, 11) (5551, 11) (7493, 11)\n" ] } ], "source": [ "train_data, test_data = train_test_split(\n", " df, test_size=0.1, shuffle=True, random_state=23\n", ")\n", "\n", "train_data, val_data = train_test_split(\n", " train_data, test_size=0.15, shuffle=True, random_state=23\n", ")\n", "\n", "# Print the shapes of the Train - Test Datasets\n", "print(\n", " \"Train - Test - Validation datasets shapes: \",\n", " train_data.shape,\n", " test_data.shape,\n", " val_data.shape,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.5 Data processing with Pipeline and ColumnTransformer\n", "(Go to Data Processing)\n", "\n", "Let's build a full model pipeline. We need pre-processing split per data type, and then combine everything back into a composite pipeline. To achieve this, we will use sklearns `Pipeline` and `ColumnTransformer`.\n", "\n", "__Step 1 (set up pre-processing per data type):__\n", "> For the numerical features pipeline, the __numerical_processor__ below, we impute missing values with the mean using sklearn's `SimpleImputer`, followed by a `MinMaxScaler` (don't have to scale features when using Decision Trees, but it's a good idea to see how to use more data transforms). If different processing is desired for different numerical features, different pipelines should be built - just like shown below for the two text features.\n", "\n", " > In the categorical features pipeline, the __categorical_processor__ below, we impute with a placeholder value and encode with sklearn's `OneHotEncoder`. If computing memory is an issue, it is a good idea to check categoricals' unique values, to get an estimate of many dummy features will be created by one-hot encoding. Note the __handle_unknown__ parameter that tells the encoder to ignore (rather than throw an error for) any unique value that might show in the validation/and or test set that was not present in the initial training set.\n", " \n", "__Step 2 (combining pre-processing methods into a transformer):__ \n", " > The selective preparations of the dataset features are then put together into a collective `ColumnTransformer`, to be finally used in a Pipeline along with an estimator. This ensures that the transforms are performed automatically on the raw data when fitting the model and when making predictions, such as when evaluating the model on a validation dataset via cross-validation or making predictions on a test dataset in the future.\n", " " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "### STEP 1 ###\n", "##############\n", "\n", "# Preprocess the numerical features\n", "numerical_processor = Pipeline(\n", " [(\"num_imputer\", SimpleImputer(strategy=\"mean\")), (\"num_scaler\", MinMaxScaler())]\n", ")\n", "# Preprocess the categorical features\n", "categorical_processor = Pipeline(\n", " [\n", " (\"cat_imputer\", SimpleImputer(strategy=\"constant\", fill_value=\"missing\")),\n", " (\n", " \"cat_encoder\",\n", " OneHotEncoder(sparse=False, drop=\"if_binary\", handle_unknown=\"ignore\"),\n", " ),\n", " ]\n", ")\n", "\n", "### STEP 2 ###\n", "##############\n", "\n", "# Combine all data preprocessors from above\n", "data_processor = ColumnTransformer(\n", " [\n", " (\"numerical_processing\", numerical_processor, numerical_features),\n", " (\"categorical_processing\", categorical_processor, categorical_features),\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# Let's take a smaller sample of all datasets to speed up the next steps\n", "train_data = train_data.sample(frac=0.2, random_state=1)\n", "val_data = val_data.sample(frac=0.2, random_state=1)\n", "test_data = test_data.sample(frac=0.2, random_state=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We learn and apply the transformation. This will create new dataframes that have many more columns (due to the one hot encoding) and will contain normalized numerical features. Careful with the usage of transformation for models that are deployed (rather than single use); we will need a transformation job that can convert incoming data into the same shape as we used to train the model." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "transf_train = data_processor.fit_transform(train_data)\n", "transf_val = data_processor.transform(val_data)\n", "transf_test = data_processor.transform(test_data)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Get the column names for categorical features\n", "cat_ft_names = list(\n", " data_processor.transformers_[1][1]\n", " .named_steps[\"cat_encoder\"]\n", " .get_feature_names(categorical_features)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have the different data splits, let's upload them to S3. Clarify requires all data to be stored in S3 and we will also need to use an in-built SageMaker model." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# Assign file names\n", "train_file, val_file, test_file, = (\n", " \"train.csv\",\n", " \"val.csv\",\n", " \"test.csv\",\n", ")\n", "\n", "# Create dataframes (preprocessed data)\n", "raw_train = pd.DataFrame(transf_train)\n", "raw_val = pd.DataFrame(transf_val)\n", "raw_test = pd.DataFrame(transf_test)\n", "\n", "# Convert\n", "\n", "# Join with model target and sensitive column and save to CSV\n", "pd.concat(\n", " [\n", " train_data[model_target].astype(int).reset_index(drop=True),\n", " train_data[sensitive_feature].astype(object).reset_index(drop=True),\n", " raw_train,\n", " ],\n", " axis=1,\n", ").to_csv(train_file, index=False, header=False, sep=\",\")\n", "\n", "# Repeat for validation data\n", "pd.concat(\n", " [\n", " val_data[model_target].astype(int).reset_index(drop=True),\n", " val_data[sensitive_feature].astype(object).reset_index(drop=True),\n", " raw_val,\n", " ],\n", " axis=1,\n", ").to_csv(val_file, index=False, header=False, sep=\",\")\n", "\n", "# Repeat for test data\n", "pd.concat(\n", " [\n", " test_data[model_target].astype(int).reset_index(drop=True),\n", " test_data[sensitive_feature].astype(object).reset_index(drop=True),\n", " raw_test,\n", " ],\n", " axis=1,\n", ").to_csv(test_file, index=False, header=False, sep=\",\")" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((8492, 62), (1110, 62), (1499, 62))" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check that all sets have same number of columns\n", "raw_train.shape, raw_test.shape, raw_val.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.5 Upload Data\n", "(Go to Data Processing)\n", "\n", "Here upload the data to S3." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "train_uri = S3Uploader.upload(train_file, \"s3://{}/{}/data/train\".format(bucket, prfx))\n", "val_uri = S3Uploader.upload(val_file, \"s3://{}/{}/data/train\".format(bucket, prfx))\n", "test_uri = S3Uploader.upload(test_file, \"s3://{}/{}/data/test\".format(bucket, prfx))" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "# Delete the CSVs from the local instance\n", "os.remove(test_file)\n", "os.remove(train_file)\n", "os.remove(val_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Train a Classifier\n", "(Go to top)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up XGBoost Estimator\n", "\n", "Next, let us set up an XGBoost estimator from the Sklearn pre-built models.\n", "\n", " " ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# Set an output path where the trained model will be saved\n", "output_path = \"s3://{}/{}/output\".format(bucket, prfx)\n", "\n", "# Set required XGboost hyperparameters\n", "hyperparams = {\n", " \"max_depth\": \"5\",\n", " \"subsample\": \"0.7\",\n", " \"verbosity\": \"1\",\n", " \"objective\": \"binary:logistic\",\n", " \"num_round\": \"10\",\n", "}\n", "\n", "# Create a container with XGBoost\n", "container = sagemaker.image_uris.retrieve(\"xgboost\", region, \"latest\")\n", "\n", "# Specify name for training job\n", "job_name = \"clarify-model-xgb-training\"\n", "\n", "# Call the XGBoost estimator object\n", "XGBoost_estimator = sagemaker.estimator.Estimator(\n", " container,\n", " get_execution_role(),\n", " hyperparameters=hyperparams,\n", " disable_profiler=True,\n", " instance_count=1,\n", " base_job_name=job_name,\n", " instance_type=\"ml.m5.2xlarge\",\n", " output_path=output_path,\n", " sagemaker_session=sagemaker.Session(),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Specify the inputs for the model and fit." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:sagemaker:Creating training-job with name: clarify-model-xgb-training-2023-04-26-05-34-54-986\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "2023-04-26 05:34:57 Starting - Starting the training job..\n", "2023-04-26 05:35:13 Starting - Preparing the instances for training........\n", "2023-04-26 05:35:57 Downloading - Downloading input data....\n", "2023-04-26 05:36:22 Training - Downloading the training image...\n", "2023-04-26 05:36:43 Training - Training image download completed. Training in progress.....\n", "2023-04-26 05:37:09 Uploading - Uploading generated training model.\n", "2023-04-26 05:37:20 Completed - Training job completed\n" ] } ], "source": [ "# Specify the training data\n", "train_input = TrainingInput(train_uri, content_type=\"csv\")\n", "\n", "# Specify the validation data - not necessary, but can be used if tuning desired\n", "validation_input = TrainingInput(val_uri, content_type=\"csv\")\n", "\n", "XGBoost_estimator.fit(\n", " {\"train\": train_input}, logs=False\n", ") # if desired \"validation\": validation_input" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are almost ready to deploy the model. One final setting we want to specify is `DataCaptureConfig`. This setting will allow us later to check if the data that is submitted to the model endpoint follows a similar distribution as the data that was used for training." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "s3_capture_upload_path = \"s3://{}/{}/datacapture\".format(bucket, prfx)\n", "\n", "data_capture_config = model_monitor.DataCaptureConfig(\n", " enable_capture=True,\n", " sampling_percentage=100,\n", " destination_s3_uri=s3_capture_upload_path,\n", ")" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:sagemaker:Creating model with name: clarify-model-xgb\n", "INFO:sagemaker:Creating endpoint-config with name clarify-model-xgb-20230426Apr\n", "INFO:sagemaker:Creating endpoint with name clarify-model-xgb-20230426Apr\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "-------!" ] } ], "source": [ "endpointname = \"clarify-model-xgb-%s\" % today.strftime(\"%Y%m%d%h\")\n", "modelname = \"clarify-model-xgb\"\n", "\n", "XGBoost_predictor = XGBoost_estimator.deploy(\n", " initial_instance_count=1,\n", " instance_type=\"ml.t2.medium\",\n", " serializer=CSVSerializer(),\n", " endpoint_name=endpointname,\n", " model_name=modelname,\n", " data_capture_config=data_capture_config,\n", ") # endpoint_name needs to be unique! Use AWS account number as part of name, or at least timestamp" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b'0.20109091699123383,0.1581142097711563,0.13745033740997314,0.10564066469669342,0.1926421970129013,0.1260198950767517,0.05225953087210655,0.24344408512115479,0.08050476759672165,0.1260198950767517,0.18747873604297638,0.14705978333950043,0.1926421970129013,0.1325863152742386,0.1926421970129013,0.09346900880336761,0.2527164816856384,0.1926421970129013,0.08050476759672165,0.41217148303985596'" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check that endpoint works by getting some predictions of the test data\n", "XGBoost_predictor.predict(raw_test.to_numpy()[:20, :])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, create a model in SageMaker (optional). This can be helpful if you want to see the model in the AWS console. Another option at this point is to compile the model (optimize for performance). " ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "model = XGBoost_estimator.create_model()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Amazon SageMaker Clarify\n", "(Go to top)\n", "\n", "Now that you have your model set up, let's say hello to SageMaker Clarify!" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: 1.0.\n", "INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.\n" ] } ], "source": [ "clarify_processor = clarify.SageMakerClarifyProcessor(\n", " role=get_execution_role(),\n", " instance_count=1,\n", " instance_type=\"ml.c4.xlarge\",\n", " sagemaker_session=Session(),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A ModelConfig object communicates information about your trained model. To avoid additional traffic to your production models, SageMaker Clarify sets up and tears down a dedicated endpoint when processing. \n", "The parameters `instance_type` and `instance_count` specify your preferred instance type and instance count used to run your model on during SageMaker Clarify’s processing." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "model_config = clarify.ModelConfig(\n", " model_name=modelname,\n", " instance_type=\"ml.m5.xlarge\",\n", " instance_count=1,\n", " accept_type=\"text/csv\",\n", " content_type=\"text/csv\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `ModelPredictedLabelConfig` provides information on the format of your predictions. XGBoost classifier produces output probabilities of samples, so SageMaker Clarify invokes the endpoint then uses `probability_threshold` to convert the probability to binary labels for bias analysis. Prediction above the threshold is interpreted as label value 1 and below or equal as label value 0." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "predictions_config = clarify.ModelPredictedLabelConfig(probability_threshold=0.6)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.1. Bias Reporting\n", "(Go to top)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `DataConfig` object communicates some basic information about data to SageMaker Clarify. We specify where to find the input dataset, where to store the output, the target column (label), the header names, and the dataset type." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "bias_report_output_path = \"s3://{}/{}/clarify-bias\".format(bucket, prfx)\n", "\n", "bias_data_config = clarify.DataConfig(\n", " s3_data_input_path=train_uri,\n", " s3_output_path=bias_report_output_path,\n", " label=model_target,\n", " headers=[model_target] + sensitive_feature + numerical_features + cat_ft_names,\n", " dataset_type=\"text/csv\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SageMaker Clarify also needs information on what the sensitive columns (`facets`) are, what the sensitive features (`facet_values_or_threshold`) may be, and what the desirable outcomes are `label_values_or_threshold`. Here `facet_name = RAC1P` and we want to look for bias against group 6. The desired outcome is with earnings $\\geq$ 50k. If you wanted to include another facet to analyze, provide a list; e.g., `facet_name = [\"RAC1P\", \"AGEP\"]`. Have a look at the documentation [here](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html)." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "# Set up bias config\n", "bias_config = clarify.BiasConfig(\n", " label_values_or_threshold=[1],\n", " facet_name=\"RAC1P\",\n", " facet_values_or_threshold=[6],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are ready to run the bias report. This type of report expects a trained model. There is also the option to evaluate for pre-training bias using:\n", "```\n", "clarify_processor.run_pre_training_bias(\n", " data_config=bias_data_config,\n", " data_bias_config=bias_config,\n", " methods=[\"CI\", \"DPL\"],\n", ")\n", "```\n", "The pre-training bias can be evaluated as soon as the data is uploaded to S3." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:sagemaker.clarify:Analysis Config: {'dataset_type': 'text/csv', 'headers': ['>50k', 'RAC1P', 'AGEP', 'WKHP', 'PWGTP', 'COW_1.0', 'COW_2.0', 'COW_3.0', 'COW_4.0', 'COW_5.0', 'COW_6.0', 'COW_7.0', 'COW_8.0', 'SCHL_1.0', 'SCHL_2.0', 'SCHL_3.0', 'SCHL_4.0', 'SCHL_5.0', 'SCHL_6.0', 'SCHL_7.0', 'SCHL_8.0', 'SCHL_9.0', 'SCHL_10.0', 'SCHL_11.0', 'SCHL_12.0', 'SCHL_13.0', 'SCHL_14.0', 'SCHL_15.0', 'SCHL_16.0', 'SCHL_17.0', 'SCHL_18.0', 'SCHL_19.0', 'SCHL_20.0', 'SCHL_21.0', 'SCHL_22.0', 'SCHL_23.0', 'SCHL_24.0', 'MAR_1.0', 'MAR_2.0', 'MAR_3.0', 'MAR_4.0', 'MAR_5.0', 'RELP_0.0', 'RELP_1.0', 'RELP_2.0', 'RELP_3.0', 'RELP_4.0', 'RELP_5.0', 'RELP_6.0', 'RELP_7.0', 'RELP_8.0', 'RELP_9.0', 'RELP_10.0', 'RELP_11.0', 'RELP_12.0', 'RELP_13.0', 'RELP_14.0', 'RELP_15.0', 'RELP_16.0', 'RELP_17.0', 'SEX_2.0', 'SCH_1.0', 'SCH_2.0', 'SCH_3.0'], 'label': '>50k', 'label_values_or_threshold': [1], 'facet': [{'name_or_index': 'RAC1P', 'value_or_threshold': [6]}], 'methods': {'report': {'name': 'report', 'title': 'Analysis Report'}, 'pre_training_bias': {'methods': 'all'}, 'post_training_bias': {'methods': 'all'}}, 'predictor': {'model_name': 'clarify-model-xgb', 'instance_type': 'ml.m5.xlarge', 'initial_instance_count': 1, 'accept_type': 'text/csv', 'content_type': 'text/csv'}, 'probability_threshold': 0.6}\n", "INFO:sagemaker:Creating processing-job with name Clarify-Bias-2023-04-26-05-41-25-016\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ ".......................................\u001b[34m2023-04-26 05:47:56,947 logging.conf not found when configuring logging, using default logging configuration.\u001b[0m\n", "\u001b[34m2023-04-26 05:47:56,947 Starting SageMaker Clarify Processing job\u001b[0m\n", "\u001b[34m2023-04-26 05:47:56,948 Analysis config path: /opt/ml/processing/input/config/analysis_config.json\u001b[0m\n", "\u001b[34m2023-04-26 05:47:56,948 Analysis result path: /opt/ml/processing/output\u001b[0m\n", "\u001b[34m2023-04-26 05:47:56,949 This host is algo-1.\u001b[0m\n", "\u001b[34m2023-04-26 05:47:56,949 This host is the leader.\u001b[0m\n", "\u001b[34m2023-04-26 05:47:56,949 Number of hosts in the cluster is 1.\u001b[0m\n", "\u001b[34m2023-04-26 05:47:57,300 Running Python / Pandas based analyzer.\u001b[0m\n", "\u001b[34m2023-04-26 05:47:57,300 Dataset type: text/csv uri: /opt/ml/processing/input/data\u001b[0m\n", "\u001b[34m2023-04-26 05:47:57,312 Loading dataset...\u001b[0m\n", "\u001b[34m/usr/local/lib/python3.9/site-packages/analyzer/data_loading/csv_data_loader.py:330: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.\n", " df = df.append(df_tmp, ignore_index=True)\u001b[0m\n", "\u001b[34m2023-04-26 05:47:57,371 Loaded dataset. Dataset info:\u001b[0m\n", "\u001b[34m\u001b[0m\n", "\u001b[34mRangeIndex: 8492 entries, 0 to 8491\u001b[0m\n", "\u001b[34mData columns (total 63 columns):\n", " # Column Non-Null Count Dtype \u001b[0m\n", "\u001b[34m--- ------ -------------- ----- \n", " 0 RAC1P 8492 non-null float64\n", " 1 AGEP 8492 non-null float64\n", " 2 WKHP 8492 non-null float64\n", " 3 PWGTP 8492 non-null float64\n", " 4 COW_1.0 8492 non-null float64\n", " 5 COW_2.0 8492 non-null float64\n", " 6 COW_3.0 8492 non-null float64\n", " 7 COW_4.0 8492 non-null float64\n", " 8 COW_5.0 8492 non-null float64\n", " 9 COW_6.0 8492 non-null float64\n", " 10 COW_7.0 8492 non-null float64\n", " 11 COW_8.0 8492 non-null float64\n", " 12 SCHL_1.0 8492 non-null float64\n", " 13 SCHL_2.0 8492 non-null float64\n", " 14 SCHL_3.0 8492 non-null float64\n", " 15 SCHL_4.0 8492 non-null float64\n", " 16 SCHL_5.0 8492 non-null float64\n", " 17 SCHL_6.0 8492 non-null float64\n", " 18 SCHL_7.0 8492 non-null float64\n", " 19 SCHL_8.0 8492 non-null float64\n", " 20 SCHL_9.0 8492 non-null float64\n", " 21 SCHL_10.0 8492 non-null float64\n", " 22 SCHL_11.0 8492 non-null float64\n", " 23 SCHL_12.0 8492 non-null float64\n", " 24 SCHL_13.0 8492 non-null float64\n", " 25 SCHL_14.0 8492 non-null float64\n", " 26 SCHL_15.0 8492 non-null float64\n", " 27 SCHL_16.0 8492 non-null float64\n", " 28 SCHL_17.0 8492 non-null float64\n", " 29 SCHL_18.0 8492 non-null float64\n", " 30 SCHL_19.0 8492 non-null float64\n", " 31 SCHL_20.0 8492 non-null float64\n", " 32 SCHL_21.0 8492 non-null float64\n", " 33 SCHL_22.0 8492 non-null float64\n", " 34 SCHL_23.0 8492 non-null float64\n", " 35 SCHL_24.0 8492 non-null float64\n", " 36 MAR_1.0 8492 non-null float64\n", " 37 MAR_2.0 8492 non-null float64\n", " 38 MAR_3.0 8492 non-null float64\n", " 39 MAR_4.0 8492 non-null float64\n", " 40 MAR_5.0 8492 non-null float64\n", " 41 RELP_0.0 8492 non-null float64\n", " 42 RELP_1.0 8492 non-null float64\n", " 43 RELP_2.0 8492 non-null float64\n", " 44 RELP_3.0 8492 non-null float64\n", " 45 RELP_4.0 8492 non-null float64\n", " 46 RELP_5.0 8492 non-null float64\n", " 47 RELP_6.0 8492 non-null float64\n", " 48 RELP_7.0 8492 non-null float64\n", " 49 RELP_8.0 8492 non-null float64\n", " 50 RELP_9.0 8492 non-null float64\n", " 51 RELP_10.0 8492 non-null float64\n", " 52 RELP_11.0 8492 non-null float64\n", " 53 RELP_12.0 8492 non-null float64\n", " 54 RELP_13.0 8492 non-null float64\n", " 55 RELP_14.0 8492 non-null float64\n", " 56 RELP_15.0 8492 non-null float64\n", " 57 RELP_16.0 8492 non-null float64\n", " 58 RELP_17.0 8492 non-null float64\n", " 59 SEX_2.0 8492 non-null float64\n", " 60 SCH_1.0 8492 non-null float64\n", " 61 SCH_2.0 8492 non-null float64\n", " 62 SCH_3.0 8492 non-null float64\u001b[0m\n", "\u001b[34mdtypes: float64(63)\u001b[0m\n", "\u001b[34mmemory usage: 4.1 MB\u001b[0m\n", "\u001b[34m2023-04-26 05:47:57,456 Spinning up shadow endpoint\u001b[0m\n", "\u001b[34m2023-04-26 05:47:57,456 Creating endpoint-config with name sm-clarify-config-1682488077-3c99\u001b[0m\n", "\u001b[34m2023-04-26 05:47:57,748 Creating endpoint: 'sm-clarify-clarify-model-xgb-1682488077-118d'\u001b[0m\n", "\u001b[34m2023-04-26 05:47:57,803 No endpoints ruleset found for service sagemaker-internal, falling back to legacy endpoint routing.\u001b[0m\n", "\u001b[34m2023-04-26 05:47:58,261 Using endpoint name: sm-clarify-clarify-model-xgb-1682488077-118d\u001b[0m\n", "\u001b[34m2023-04-26 05:47:58,261 Waiting for endpoint ...\u001b[0m\n", "\u001b[34m2023-04-26 05:47:58,261 Checking endpoint status:\u001b[0m\n", "\u001b[34mLegend:\u001b[0m\n", "\u001b[34m(OutOfService: x, Creating: -, Updating: -, InService: !, RollingBack: <, Deleting: o, Failed: *)\u001b[0m\n", "\u001b[34m2023-04-26 05:50:58,834 Endpoint is in service after 181 seconds\u001b[0m\n", "\u001b[34m2023-04-26 05:50:58,835 Endpoint ready.\u001b[0m\n", "\u001b[34m2023-04-26 05:50:58,835 ======================================\u001b[0m\n", "\u001b[34m2023-04-26 05:50:58,835 Calculating post-training bias metrics\u001b[0m\n", "\u001b[34m2023-04-26 05:50:58,835 ======================================\u001b[0m\n", "\u001b[34m2023-04-26 05:50:58,835 Getting predictions from the endpoint\u001b[0m\n", "\u001b[34m2023-04-26 05:50:59,920 We assume a prediction above 0.600 indicates 1 and below or equal indicates 0.\u001b[0m\n", "\u001b[34m2023-04-26 05:50:59,921 Column >50k with data uniqueness fraction 0.00023551577955723034 is classifed as a CATEGORICAL column\u001b[0m\n", "\u001b[34m2023-04-26 05:50:59,923 Column RAC1P with data uniqueness fraction 0.00023551577955723034 is classifed as a CONTINUOUS column\u001b[0m\n", "\u001b[34m/usr/local/lib/python3.9/site-packages/smclarify/bias/report.py:591: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.\n", " df = df.drop(facet_column.name, 1)\u001b[0m\n", "\u001b[34m2023-04-26 05:50:59,925 Column >50k with data uniqueness fraction 0.00023551577955723034 is classifed as a CATEGORICAL column\u001b[0m\n", "\u001b[34m2023-04-26 05:50:59,927 Column None with data uniqueness fraction 0.00023551577955723034 is classifed as a CATEGORICAL column\u001b[0m\n", "\u001b[34m2023-04-26 05:50:59,928 Threshold Interval indices: IntervalIndex([(6.0, 8.0]], dtype='interval[float64, right]')\u001b[0m\n", "\u001b[34m2023-04-26 05:51:00,027 CDDPL metrics failed\u001b[0m\n", "\u001b[34mTraceback (most recent call last):\n", " File \"/usr/local/lib/python3.9/site-packages/smclarify/bias/report.py\", line 357, in _continuous_metric_call_wrapper\n", " metric_value = smclarify.bias.metrics.call_metric(\n", " File \"/usr/local/lib/python3.9/site-packages/smclarify/bias/metrics/__init__.py\", line 27, in call_metric\n", " return metric(**{key: kwargs[key] for key in inspect.signature(metric).parameters.keys()})\n", " File \"/usr/local/lib/python3.9/site-packages/smclarify/bias/metrics/posttraining.py\", line 277, in CDDPL\n", " return common.CDD(feature, sensitive_facet_index, positive_predicted_label_index, group_variable)\n", " File \"/usr/local/lib/python3.9/site-packages/smclarify/bias/metrics/common.py\", line 100, in CDD\n", " raise ValueError(\"Group variable is empty or not provided\")\u001b[0m\n", "\u001b[34mValueError: Group variable is empty or not provided\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,007 Calculated global analysis with predictor\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,007 Stop using endpoint: sm-clarify-clarify-model-xgb-1682488077-118d\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,008 Deleting endpoint configuration with name: sm-clarify-config-1682488077-3c99\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,167 Deleting endpoint with name: sm-clarify-clarify-model-xgb-1682488077-118d\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,342 Model endpoint delivered 0.92049 requests per second and a total of 2 requests over 2 seconds\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,343 =====================================\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,343 Calculating pre-training bias metrics\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,343 =====================================\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,343 Column >50k with data uniqueness fraction 0.00023551577955723034 is classifed as a CATEGORICAL column\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,346 Column RAC1P with data uniqueness fraction 0.00023551577955723034 is classifed as a CONTINUOUS column\u001b[0m\n", "\u001b[34m/usr/local/lib/python3.9/site-packages/smclarify/bias/report.py:591: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.\n", " df = df.drop(facet_column.name, 1)\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,349 Column >50k with data uniqueness fraction 0.00023551577955723034 is classifed as a CATEGORICAL column\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,351 Threshold Interval indices: IntervalIndex([(6.0, 8.0]], dtype='interval[float64, right]')\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,399 CDDL metrics failed\u001b[0m\n", "\u001b[34mTraceback (most recent call last):\n", " File \"/usr/local/lib/python3.9/site-packages/smclarify/bias/report.py\", line 357, in _continuous_metric_call_wrapper\n", " metric_value = smclarify.bias.metrics.call_metric(\n", " File \"/usr/local/lib/python3.9/site-packages/smclarify/bias/metrics/__init__.py\", line 27, in call_metric\n", " return metric(**{key: kwargs[key] for key in inspect.signature(metric).parameters.keys()})\n", " File \"/usr/local/lib/python3.9/site-packages/smclarify/bias/metrics/pretraining.py\", line 195, in CDDL\n", " return common.CDD(feature, sensitive_facet_index, positive_label_index, group_variable)\n", " File \"/usr/local/lib/python3.9/site-packages/smclarify/bias/metrics/common.py\", line 100, in CDD\n", " raise ValueError(\"Group variable is empty or not provided\")\u001b[0m\n", "\u001b[34mValueError: Group variable is empty or not provided\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,775 ======================================\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,776 Calculating bias statistics for report\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,776 ======================================\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,776 Column >50k with data uniqueness fraction 0.00023551577955723034 is classifed as a CATEGORICAL column\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,778 Column RAC1P with data uniqueness fraction 0.00023551577955723034 is classifed as a CONTINUOUS column\u001b[0m\n", "\u001b[34m/usr/local/lib/python3.9/site-packages/smclarify/bias/report.py:591: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.\n", " df = df.drop(facet_column.name, 1)\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,780 Column >50k with data uniqueness fraction 0.00023551577955723034 is classifed as a CATEGORICAL column\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,781 Column None with data uniqueness fraction 0.00023551577955723034 is classifed as a CATEGORICAL column\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,783 Threshold Interval indices: IntervalIndex([(6.0, 8.0]], dtype='interval[float64, right]')\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,931 Column >50k with data uniqueness fraction 0.00023551577955723034 is classifed as a CATEGORICAL column\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,933 Column None with data uniqueness fraction 0.00023551577955723034 is classifed as a CATEGORICAL column\u001b[0m\n", "\u001b[34m2023-04-26 05:51:01,942 Converting Pandas DataFrame to SparkDataFrame for computing report metadata\u001b[0m\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[34m--!05:51:04.519 [main] WARN o.a.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\u001b[0m\n", "\u001b[34m05:51:11.335 [Thread-4] WARN o.a.spark.sql.catalyst.util.package - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.\u001b[0m\n", "\u001b[34m05:51:12.722 [dispatcher-event-loop-2] WARN o.a.spark.scheduler.TaskSetManager - Stage 0 contains a task of very large size (1152 KiB). The maximum recommended task size is 1000 KiB.\u001b[0m\n", "\u001b[34m05:51:14.644 [dispatcher-event-loop-2] WARN o.a.spark.scheduler.TaskSetManager - Stage 3 contains a task of very large size (1152 KiB). The maximum recommended task size is 1000 KiB.\u001b[0m\n", "\u001b[34m05:51:15.538 [dispatcher-event-loop-1] WARN o.a.spark.scheduler.TaskSetManager - Stage 9 contains a task of very large size (1152 KiB). The maximum recommended task size is 1000 KiB.\u001b[0m\n", "\u001b[34m05:51:15.814 [dispatcher-event-loop-2] WARN o.a.spark.scheduler.TaskSetManager - Stage 12 contains a task of very large size (1152 KiB). The maximum recommended task size is 1000 KiB.\u001b[0m\n", "\u001b[34m05:51:16.218 [dispatcher-event-loop-1] WARN o.a.spark.scheduler.TaskSetManager - Stage 13 contains a task of very large size (1152 KiB). The maximum recommended task size is 1000 KiB.\u001b[0m\n", "\u001b[34m05:51:16.554 [dispatcher-event-loop-3] WARN o.a.spark.scheduler.TaskSetManager - Stage 19 contains a task of very large size (1152 KiB). The maximum recommended task size is 1000 KiB.\u001b[0m\n", "\u001b[34m05:51:16.787 [dispatcher-event-loop-3] WARN o.a.spark.scheduler.TaskSetManager - Stage 22 contains a task of very large size (1152 KiB). The maximum recommended task size is 1000 KiB.\u001b[0m\n", "\u001b[34m05:51:17.090 [dispatcher-event-loop-0] WARN o.a.spark.scheduler.TaskSetManager - Stage 23 contains a task of very large size (1152 KiB). The maximum recommended task size is 1000 KiB.\u001b[0m\n", "\u001b[34m05:51:17.614 [dispatcher-event-loop-0] WARN o.a.spark.scheduler.TaskSetManager - Stage 29 contains a task of very large size (1152 KiB). The maximum recommended task size is 1000 KiB.\u001b[0m\n", "\u001b[34m05:51:17.838 [dispatcher-event-loop-1] WARN o.a.spark.scheduler.TaskSetManager - Stage 32 contains a task of very large size (1152 KiB). The maximum recommended task size is 1000 KiB.\u001b[0m\n", "\u001b[34m#015[Stage 0:> (0 + 4) / 4]#015#015[Stage 0:==============> (1 + 3) / 4]#015#015 #0152023-04-26 05:51:17,993 Calculated global analysis without predictor\u001b[0m\n", "\u001b[34m2023-04-26 05:51:17,993 Stop using endpoint: None\u001b[0m\n", "\u001b[34m2023-04-26 05:51:18,948 ['jupyter', 'nbconvert', '--to', 'html', '--output', '/opt/ml/processing/output/report.html', '/opt/ml/processing/output/report.ipynb', '--template', 'sagemaker-xai']\u001b[0m\n", "\u001b[34m[NbConvertApp] Converting notebook /opt/ml/processing/output/report.ipynb to html\u001b[0m\n", "\u001b[34m[NbConvertApp] Writing 750340 bytes to /opt/ml/processing/output/report.html\u001b[0m\n", "\u001b[34m2023-04-26 05:51:20,022 ['wkhtmltopdf', '-q', '--enable-local-file-access', '/opt/ml/processing/output/report.html', '/opt/ml/processing/output/report.pdf']\u001b[0m\n", "\u001b[34m2023-04-26 05:51:21,397 Collected analyses: \u001b[0m\n", "\u001b[34m{\n", " \"version\": \"1.0\",\n", " \"post_training_bias_metrics\": {\n", " \"label\": \">50k\",\n", " \"facets\": {\n", " \"RAC1P\": [\n", " {\n", " \"value_or_threshold\": \"(6.0, 8.0]\",\n", " \"metrics\": [\n", " {\n", " \"name\": \"AD\",\n", " \"description\": \"Accuracy Difference (AD)\",\n", " \"value\": -0.060298268619563\n", " },\n", " {\n", " \"name\": \"CDDPL\",\n", " \"description\": \"Conditional Demographic Disparity in Predicted Labels (CDDPL)\",\n", " \"value\": null,\n", " \"error\": \"Group variable is empty or not provided\"\n", " },\n", " {\n", " \"name\": \"DAR\",\n", " \"description\": \"Difference in Acceptance Rates (DAR)\",\n", " \"value\": 0.03853525992080209\n", " },\n", " {\n", " \"name\": \"DCA\",\n", " \"description\": \"Difference in Conditional Acceptance (DCA)\",\n", " \"value\": -3.465972280731317\n", " },\n", " {\n", " \"name\": \"DCR\",\n", " \"description\": \"Difference in Conditional Rejection (DCR)\",\n", " \"value\": -0.019338711280181053\n", " },\n", " {\n", " \"name\": \"DI\",\n", " \"description\": \"Disparate Impact (DI)\",\n", " \"value\": 0.10435167424972934\n", " },\n", " {\n", " \"name\": \"DPPL\",\n", " \"description\": \"Difference in Positive Proportions in Predicted Labels (DPPL)\",\n", " \"value\": 0.3546276018474536\n", " },\n", " {\n", " \"name\": \"DRR\",\n", " \"description\": \"Difference in Rejection Rates (DRR)\",\n", " \"value\": 0.09243844488199526\n", " },\n", " {\n", " \"name\": \"FT\",\n", " \"description\": \"Flip Test (FT)\",\n", " \"value\": -0.09188095926032938\n", " },\n", " {\n", " \"name\": \"GE\",\n", " \"description\": \"Generalized Entropy (GE)\",\n", " \"value\": 0.1227562795261298\n", " },\n", " {\n", " \"name\": \"RD\",\n", " \"description\": \"Recall Difference (RD)\",\n", " \"value\": 0.506130406929783\n", " },\n", " {\n", " \"name\": \"SD\",\n", " \"description\": \"Specificity Difference (SD)\",\n", " \"value\": 0.12959372781998368\n", " },\n", " {\n", " \"name\": \"TE\",\n", " \"description\": \"Treatment Equality (TE)\",\n", " \"value\": 15.291158536585366\n", " }\n", " ]\n", " }\n", " ]\n", " },\n", " \"label_value_or_threshold\": \"1\"\n", " },\n", " \"pre_training_bias_metrics\": {\n", " \"label\": \">50k\",\n", " \"facets\": {\n", " \"RAC1P\": [\n", " {\n", " \"value_or_threshold\": \"(6.0, 8.0]\",\n", " \"metrics\": [\n", " {\n", " \"name\": \"CDDL\",\n", " \"description\": \"Conditional Demographic Disparity in Labels (CDDL)\",\n", " \"value\": null,\n", " \"error\": \"Group variable is empty or not provided\"\n", " },\n", " {\n", " \"name\": \"CI\",\n", " \"description\": \"Class Imbalance (CI)\",\n", " \"value\": 0.18487988695242583\n", " },\n", " {\n", " \"name\": \"DPL\",\n", " \"description\": \"Difference in Positive Proportions in Labels (DPL)\",\n", " \"value\": 0.2867271170692013\n", " },\n", " {\n", " \"name\": \"JS\",\n", " \"description\": \"Jensen-Shannon Divergence (JS)\",\n", " \"value\": 0.047136886083804404\n", " },\n", " {\n", " \"name\": \"KL\",\n", " \"description\": \"Kullback-Liebler Divergence (KL)\",\n", " \"value\": 0.2082746966374284\n", " },\n", " {\n", " \"name\": \"KS\",\n", " \"description\": \"Kolmogorov-Smirnov Distance (KS)\",\n", " \"value\": 0.2867271170692013\n", " },\n", " {\n", " \"name\": \"LP\",\n", " \"description\": \"L-p Norm (LP)\",\n", " \"value\": 0.40549337765940263\n", " },\n", " {\n", " \"name\": \"TVD\",\n", " \"description\": \"Total Variation Distance (TVD)\",\n", " \"value\": 0.2867271170692013\n", " }\n", " ]\n", " }\n", " ]\n", " },\n", " \"label_value_or_threshold\": \"1\"\n", " }\u001b[0m\n", "\u001b[34m}\u001b[0m\n", "\u001b[34m2023-04-26 05:51:21,398 exit_message: Completed: SageMaker XAI Analyzer ran successfully\u001b[0m\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "clarify_processor.run_bias(\n", " data_config=bias_data_config,\n", " bias_config=bias_config,\n", " model_config=model_config,\n", " model_predicted_label_config=predictions_config,\n", " pre_training_methods=\"all\",\n", " post_training_methods=\"all\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2. Explainability Reporting\n", "(Go to top)\n", "\n", "Let's move on to explainability concepts. We start with a test value and baseline." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "baseline = (\n", " pd.concat(\n", " [\n", " test_data[sensitive_feature].astype(object).reset_index(drop=True),\n", " raw_test,\n", " ],\n", " axis=1,\n", " )\n", " .iloc[0]\n", " .values.tolist()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar to the bias report, we first need to provide some specifications." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Set up path for where explainability output should be stored\n", "explainability_report_output_path = \"s3://{}/{}/clarify-explainability\".format(\n", " bucket, prfx\n", ")\n", "\n", "# Initialize configuration for SHAP\n", "shap_config = clarify.SHAPConfig(\n", " baseline=[baseline],\n", " num_samples=15,\n", " agg_method=\"mean_abs\",\n", " save_local_shap_values=True,\n", ")\n", "\n", "# Provide details about the data used for the explainability example\n", "explainability_data_config = clarify.DataConfig(\n", " s3_data_input_path=train_uri,\n", " s3_output_path=explainability_report_output_path,\n", " label=model_target,\n", " headers=[model_target] + sensitive_feature + numerical_features + cat_ft_names,\n", " dataset_type=\"text/csv\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can run the explainability report." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:sagemaker.clarify:Analysis Config: {'dataset_type': 'text/csv', 'headers': ['>50k', 'RAC1P', 'AGEP', 'WKHP', 'PWGTP', 'COW_1.0', 'COW_2.0', 'COW_3.0', 'COW_4.0', 'COW_5.0', 'COW_6.0', 'COW_7.0', 'COW_8.0', 'SCHL_1.0', 'SCHL_2.0', 'SCHL_3.0', 'SCHL_4.0', 'SCHL_5.0', 'SCHL_6.0', 'SCHL_7.0', 'SCHL_8.0', 'SCHL_9.0', 'SCHL_10.0', 'SCHL_11.0', 'SCHL_12.0', 'SCHL_13.0', 'SCHL_14.0', 'SCHL_15.0', 'SCHL_16.0', 'SCHL_17.0', 'SCHL_18.0', 'SCHL_19.0', 'SCHL_20.0', 'SCHL_21.0', 'SCHL_22.0', 'SCHL_23.0', 'SCHL_24.0', 'MAR_1.0', 'MAR_2.0', 'MAR_3.0', 'MAR_4.0', 'MAR_5.0', 'RELP_0.0', 'RELP_1.0', 'RELP_2.0', 'RELP_3.0', 'RELP_4.0', 'RELP_5.0', 'RELP_6.0', 'RELP_7.0', 'RELP_8.0', 'RELP_9.0', 'RELP_10.0', 'RELP_11.0', 'RELP_12.0', 'RELP_13.0', 'RELP_14.0', 'RELP_15.0', 'RELP_16.0', 'RELP_17.0', 'SEX_2.0', 'SCH_1.0', 'SCH_2.0', 'SCH_3.0'], 'label': '>50k', 'predictor': {'model_name': 'clarify-model-xgb', 'instance_type': 'ml.m5.xlarge', 'initial_instance_count': 1, 'accept_type': 'text/csv', 'content_type': 'text/csv'}, 'methods': {'report': {'name': 'report', 'title': 'Analysis Report'}, 'shap': {'use_logit': False, 'save_local_shap_values': True, 'baseline': [[6.0, 0.3766233766233767, 0.39795918367346933, 0.07665260196905767, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]], 'num_samples': 15, 'agg_method': 'mean_abs'}}}\n", "INFO:sagemaker:Creating processing-job with name Clarify-Explainability-2023-04-26-05-52-02-339\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "........................................\u001b[34m2023-04-26 05:58:41,018 logging.conf not found when configuring logging, using default logging configuration.\u001b[0m\n", "\u001b[34m2023-04-26 05:58:41,019 Starting SageMaker Clarify Processing job\u001b[0m\n", "\u001b[34m2023-04-26 05:58:41,024 Analysis config path: /opt/ml/processing/input/config/analysis_config.json\u001b[0m\n", "\u001b[34m2023-04-26 05:58:41,024 Analysis result path: /opt/ml/processing/output\u001b[0m\n", "\u001b[34m2023-04-26 05:58:41,025 This host is algo-1.\u001b[0m\n", "\u001b[34m2023-04-26 05:58:41,026 This host is the leader.\u001b[0m\n", "\u001b[34m2023-04-26 05:58:41,026 Number of hosts in the cluster is 1.\u001b[0m\n", "\u001b[34m2023-04-26 05:58:41,294 Running Python / Pandas based analyzer.\u001b[0m\n", "\u001b[34m2023-04-26 05:58:41,294 Dataset type: text/csv uri: /opt/ml/processing/input/data\u001b[0m\n", "\u001b[34m2023-04-26 05:58:41,306 Loading dataset...\u001b[0m\n", "\u001b[34m/usr/local/lib/python3.9/site-packages/analyzer/data_loading/csv_data_loader.py:330: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.\n", " df = df.append(df_tmp, ignore_index=True)\u001b[0m\n", "\u001b[34m2023-04-26 05:58:41,364 Loaded dataset. Dataset info:\u001b[0m\n", "\u001b[34m\u001b[0m\n", "\u001b[34mRangeIndex: 8492 entries, 0 to 8491\u001b[0m\n", "\u001b[34mData columns (total 63 columns):\n", " # Column Non-Null Count Dtype \u001b[0m\n", "\u001b[34m--- ------ -------------- ----- \n", " 0 RAC1P 8492 non-null float64\n", " 1 AGEP 8492 non-null float64\n", " 2 WKHP 8492 non-null float64\n", " 3 PWGTP 8492 non-null float64\n", " 4 COW_1.0 8492 non-null float64\n", " 5 COW_2.0 8492 non-null float64\n", " 6 COW_3.0 8492 non-null float64\n", " 7 COW_4.0 8492 non-null float64\n", " 8 COW_5.0 8492 non-null float64\n", " 9 COW_6.0 8492 non-null float64\n", " 10 COW_7.0 8492 non-null float64\n", " 11 COW_8.0 8492 non-null float64\n", " 12 SCHL_1.0 8492 non-null float64\n", " 13 SCHL_2.0 8492 non-null float64\n", " 14 SCHL_3.0 8492 non-null float64\n", " 15 SCHL_4.0 8492 non-null float64\n", " 16 SCHL_5.0 8492 non-null float64\n", " 17 SCHL_6.0 8492 non-null float64\n", " 18 SCHL_7.0 8492 non-null float64\n", " 19 SCHL_8.0 8492 non-null float64\n", " 20 SCHL_9.0 8492 non-null float64\n", " 21 SCHL_10.0 8492 non-null float64\n", " 22 SCHL_11.0 8492 non-null float64\n", " 23 SCHL_12.0 8492 non-null float64\n", " 24 SCHL_13.0 8492 non-null float64\n", " 25 SCHL_14.0 8492 non-null float64\n", " 26 SCHL_15.0 8492 non-null float64\n", " 27 SCHL_16.0 8492 non-null float64\n", " 28 SCHL_17.0 8492 non-null float64\n", " 29 SCHL_18.0 8492 non-null float64\n", " 30 SCHL_19.0 8492 non-null float64\n", " 31 SCHL_20.0 8492 non-null float64\n", " 32 SCHL_21.0 8492 non-null float64\n", " 33 SCHL_22.0 8492 non-null float64\n", " 34 SCHL_23.0 8492 non-null float64\n", " 35 SCHL_24.0 8492 non-null float64\n", " 36 MAR_1.0 8492 non-null float64\n", " 37 MAR_2.0 8492 non-null float64\n", " 38 MAR_3.0 8492 non-null float64\n", " 39 MAR_4.0 8492 non-null float64\n", " 40 MAR_5.0 8492 non-null float64\n", " 41 RELP_0.0 8492 non-null float64\n", " 42 RELP_1.0 8492 non-null float64\n", " 43 RELP_2.0 8492 non-null float64\n", " 44 RELP_3.0 8492 non-null float64\n", " 45 RELP_4.0 8492 non-null float64\n", " 46 RELP_5.0 8492 non-null float64\n", " 47 RELP_6.0 8492 non-null float64\n", " 48 RELP_7.0 8492 non-null float64\n", " 49 RELP_8.0 8492 non-null float64\n", " 50 RELP_9.0 8492 non-null float64\n", " 51 RELP_10.0 8492 non-null float64\n", " 52 RELP_11.0 8492 non-null float64\n", " 53 RELP_12.0 8492 non-null float64\n", " 54 RELP_13.0 8492 non-null float64\n", " 55 RELP_14.0 8492 non-null float64\n", " 56 RELP_15.0 8492 non-null float64\n", " 57 RELP_16.0 8492 non-null float64\n", " 58 RELP_17.0 8492 non-null float64\n", " 59 SEX_2.0 8492 non-null float64\n", " 60 SCH_1.0 8492 non-null float64\n", " 61 SCH_2.0 8492 non-null float64\n", " 62 SCH_3.0 8492 non-null float64\u001b[0m\n", "\u001b[34mdtypes: float64(63)\u001b[0m\n", "\u001b[34mmemory usage: 4.1 MB\u001b[0m\n", "\u001b[34m2023-04-26 05:58:41,454 Spinning up shadow endpoint\u001b[0m\n", "\u001b[34m2023-04-26 05:58:41,454 Creating endpoint-config with name sm-clarify-config-1682488721-ed87\u001b[0m\n", "\u001b[34m2023-04-26 05:58:41,767 Creating endpoint: 'sm-clarify-clarify-model-xgb-1682488721-365f'\u001b[0m\n", "\u001b[34m2023-04-26 05:58:41,823 No endpoints ruleset found for service sagemaker-internal, falling back to legacy endpoint routing.\u001b[0m\n", "\u001b[34m2023-04-26 05:58:42,284 Using endpoint name: sm-clarify-clarify-model-xgb-1682488721-365f\u001b[0m\n", "\u001b[34m2023-04-26 05:58:42,284 Waiting for endpoint ...\u001b[0m\n", "\u001b[34m2023-04-26 05:58:42,284 Checking endpoint status:\u001b[0m\n", "\u001b[34mLegend:\u001b[0m\n", "\u001b[34m(OutOfService: x, Creating: -, Updating: -, InService: !, RollingBack: <, Deleting: o, Failed: *)\u001b[0m\n", "\u001b[34m2023-04-26 06:01:42,834 Endpoint is in service after 181 seconds\u001b[0m\n", "\u001b[34m2023-04-26 06:01:42,835 Endpoint ready.\u001b[0m\n", "\u001b[34m2023-04-26 06:01:42,840 The number of requested samples 15 is too small given the number of features 63\u001b[0m\n", "\u001b[34m2023-04-26 06:01:42,840 SHAP n_samples 15\u001b[0m\n", "\u001b[34m2023-04-26 06:01:42,940 =====================================================\u001b[0m\n", "\u001b[34m2023-04-26 06:01:42,941 Shap analyzer: explaining 8492 rows, 63 columns...\u001b[0m\n", "\u001b[34m2023-04-26 06:01:42,941 =====================================================\n", " 0% (0 of 8492) | | Elapsed Time: 0:00:00 ETA: --:--:--\u001b[0m\n", "\u001b[34m 36% (3101 of 8492) |####### | Elapsed Time: 0:00:30 ETA: 0:00:52\u001b[0m\n", "\u001b[34m 74% (6314 of 8492) |############## | Elapsed Time: 0:01:00 ETA: 0:00:20\u001b[0m\n", "\u001b[34m100% (8492 of 8492) |####################| Elapsed Time: 0:01:19 Time: 0:01:19\u001b[0m\n", "\u001b[34m2023-04-26 06:03:02,292 getting explanations took 79.35 seconds.\u001b[0m\n", "\u001b[34m2023-04-26 06:03:02,292 ===================================================\u001b[0m\n", "\u001b[34m2023-04-26 06:03:02,293 Falling back to generic labels: label0, label1, ...\u001b[0m\n", "\u001b[34m2023-04-26 06:03:03,569 converting explanations to tabular took 1.28 seconds.\u001b[0m\n", "\u001b[34m2023-04-26 06:03:03,569 ===================================================\u001b[0m\n", "\u001b[34m2023-04-26 06:03:03,573 Wrote baseline used to compute explanations to: /opt/ml/processing/output/explanations_shap/baseline.csv\u001b[0m\n", "\u001b[34m2023-04-26 06:03:04,291 Wrote 8492 local explanations to: /opt/ml/processing/output/explanations_shap/out.csv\u001b[0m\n", "\u001b[34m2023-04-26 06:03:04,291 writing local explanations took 0.72 seconds.\u001b[0m\n", "\u001b[34m2023-04-26 06:03:04,291 ===================================================\u001b[0m\n", "\u001b[34m/usr/local/lib/python3.9/site-packages/numpy/core/fromnumeric.py:3430: FutureWarning: In a future version, DataFrame.mean(axis=None) will return a scalar mean over the entire DataFrame. To retain the old behavior, use 'frame.mean(axis=0)' or just 'frame.mean()'\n", " return mean(axis=axis, dtype=dtype, out=out, **kwargs)\u001b[0m\n", "\u001b[34m2023-04-26 06:03:04,296 aggregating local explanations took 0.00 seconds.\u001b[0m\n", "\u001b[34m2023-04-26 06:03:04,296 ===================================================\u001b[0m\n", "\u001b[34m2023-04-26 06:03:04,296 Shap analysis finished.\u001b[0m\n", "\u001b[34m2023-04-26 06:03:04,296 Calculated global analysis with predictor\u001b[0m\n", "\u001b[34m2023-04-26 06:03:04,298 Stop using endpoint: sm-clarify-clarify-model-xgb-1682488721-365f\u001b[0m\n", "\u001b[34m2023-04-26 06:03:04,298 Deleting endpoint configuration with name: sm-clarify-config-1682488721-ed87\u001b[0m\n", "\u001b[34m2023-04-26 06:03:04,483 Deleting endpoint with name: sm-clarify-clarify-model-xgb-1682488721-365f\u001b[0m\n", "\u001b[34m2023-04-26 06:03:04,619 Model endpoint delivered 104.26802 requests per second and a total of 8494 requests over 81 seconds\u001b[0m\n", "\u001b[34m2023-04-26 06:03:04,619 Calculated global analysis without predictor\u001b[0m\n", "\u001b[34m2023-04-26 06:03:04,858 Stop using endpoint: None\u001b[0m\n", "\u001b[34m2023-04-26 06:03:21,818 ['jupyter', 'nbconvert', '--to', 'html', '--output', '/opt/ml/processing/output/report.html', '/opt/ml/processing/output/report.ipynb', '--template', 'sagemaker-xai']\u001b[0m\n", "\u001b[34m[NbConvertApp] Converting notebook /opt/ml/processing/output/report.ipynb to html\u001b[0m\n", "\u001b[34m[NbConvertApp] Writing 494434 bytes to /opt/ml/processing/output/report.html\u001b[0m\n", "\u001b[34m2023-04-26 06:03:22,895 ['wkhtmltopdf', '-q', '--enable-local-file-access', '/opt/ml/processing/output/report.html', '/opt/ml/processing/output/report.pdf']\u001b[0m\n", "\u001b[34m2023-04-26 06:03:23,876 Collected analyses: \u001b[0m\n", "\u001b[34m{\n", " \"version\": \"1.0\",\n", " \"explanations\": {\n", " \"kernel_shap\": {\n", " \"label0\": {\n", " \"global_shap_values\": {\n", " \"RAC1P\": 0.010994280314049993,\n", " \"AGEP\": 0.011343420142187187,\n", " \"WKHP\": 0.019087475974209695,\n", " \"PWGTP\": 0.008927441671942444,\n", " \"COW_1.0\": 0.008444419414047668,\n", " \"COW_2.0\": 0.008376523074323839,\n", " \"COW_3.0\": 0.008396849503578146,\n", " \"COW_4.0\": 0.008540521601942217,\n", " \"COW_5.0\": 0.00857102657913103,\n", " \"COW_6.0\": 0.00951128484658168,\n", " \"COW_7.0\": 0.00854089973103888,\n", " \"COW_8.0\": 0.008346755968380127,\n", " \"SCHL_1.0\": 0.00879904379519125,\n", " \"SCHL_2.0\": 0.008494733836321057,\n", " \"SCHL_3.0\": 0.00844835745360474,\n", " \"SCHL_4.0\": 0.008417146158685067,\n", " \"SCHL_5.0\": 0.008448953273449305,\n", " \"SCHL_6.0\": 0.008453068220061696,\n", " \"SCHL_7.0\": 0.008160942280178095,\n", " \"SCHL_8.0\": 0.00863995538256826,\n", " \"SCHL_9.0\": 0.008580852489928993,\n", " \"SCHL_10.0\": 0.008496855400146278,\n", " \"SCHL_11.0\": 0.008697984291176335,\n", " \"SCHL_12.0\": 0.008533599178488446,\n", " \"SCHL_13.0\": 0.008486842509976384,\n", " \"SCHL_14.0\": 0.008411687836710854,\n", " \"SCHL_15.0\": 0.008528468174722343,\n", " \"SCHL_16.0\": 0.009559579139739337,\n", " \"SCHL_17.0\": 0.008531130991807252,\n", " \"SCHL_18.0\": 0.008519639440666535,\n", " \"SCHL_19.0\": 0.008447328794949452,\n", " \"SCHL_20.0\": 0.008366112915286722,\n", " \"SCHL_21.0\": 0.009453603015700147,\n", " \"SCHL_22.0\": 0.03275036939570261,\n", " \"SCHL_23.0\": 0.00856912950796835,\n", " \"SCHL_24.0\": 0.0085400461975767,\n", " \"MAR_1.0\": 0.008918630466031481,\n", " \"MAR_2.0\": 0.008537189382931677,\n", " \"MAR_3.0\": 0.008531051077413095,\n", " \"MAR_4.0\": 0.008633503201661882,\n", " \"MAR_5.0\": 0.00848991253024193,\n", " \"RELP_0.0\": 0.015203430947751962,\n", " \"RELP_1.0\": 0.008916140166072607,\n", " \"RELP_2.0\": 0.008964632867267322,\n", " \"RELP_3.0\": 0.008456441376363678,\n", " \"RELP_4.0\": 0.008420626087893995,\n", " \"RELP_5.0\": 0.008452286955398066,\n", " \"RELP_6.0\": 0.008476895746459845,\n", " \"RELP_7.0\": 0.00847441291148814,\n", " \"RELP_8.0\": 0.008655640523320596,\n", " \"RELP_9.0\": 0.00831696548109121,\n", " \"RELP_10.0\": 0.008496757951377954,\n", " \"RELP_11.0\": 0.008381340632098968,\n", " \"RELP_12.0\": 0.00849438730870199,\n", " \"RELP_13.0\": 0.008541061519077133,\n", " \"RELP_14.0\": 0.008474709347469715,\n", " \"RELP_15.0\": 0.008671917901277498,\n", " \"RELP_16.0\": 0.008493318879756559,\n", " \"RELP_17.0\": 0.008592975591119463,\n", " \"SEX_2.0\": 0.010532953246517304,\n", " \"SCH_1.0\": 0.00937221399479256,\n", " \"SCH_2.0\": 0.008609507268384031,\n", " \"SCH_3.0\": 0.14263266193518473\n", " },\n", " \"expected_value\": 0.8636011481285095\n", " }\n", " }\n", " }\u001b[0m\n", "\u001b[34m}\u001b[0m\n", "\u001b[34m2023-04-26 06:03:23,876 exit_message: Completed: SageMaker XAI Analyzer ran successfully\u001b[0m\n", "\u001b[34m--!\u001b[0m\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "# Run the explainability job\n", "clarify_processor.run_explainability(\n", " data_config=explainability_data_config,\n", " model_config=model_config,\n", " explainability_config=shap_config,\n", ")" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Example number: 111 \n", "with model prediction: False\n", "\n", "Feature values -- Label 0 0.207792\n", "1 0.397959\n", "2 0.120253\n", "3 1.000000\n", "4 0.000000\n", " ... \n", "57 0.000000\n", "58 0.000000\n", "59 1.000000\n", "60 0.000000\n", "61 0.000000\n", "Name: 111, Length: 62, dtype: float64\n" ] } ], "source": [ "# Specify path to save explanations\n", "local_explanations_out = pd.read_csv(\n", " explainability_report_output_path + \"/explanations_shap/out.csv\"\n", ")\n", "\n", "# Extract and update feature names\n", "feature_names = [\n", " str.replace(c, \"_label0\", \"\") for c in local_explanations_out.columns.to_series()\n", "]\n", "local_explanations_out.columns = feature_names\n", "\n", "# Show selected example\n", "selected_example = 111\n", "print(\n", " \"Example number:\",\n", " selected_example,\n", " \"\\nwith model prediction:\",\n", " sum(local_explanations_out.iloc[selected_example]) > 0,\n", ")\n", "print(\"\\nFeature values -- Label\", raw_train.iloc[selected_example])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Amazon Model Monitor\n", "(Go to top)\n", "\n", "A lot of the steps we took above, can immediately be used to set up monitoring too. With the existing model and data configuration we can set up a baseline job for example." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: 1.0.\n", "INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.\n" ] } ], "source": [ "model_bias_monitor = model_monitor.ModelBiasMonitor(\n", " role=get_execution_role(),\n", " sagemaker_session=Session(),\n", " max_runtime_in_seconds=1800,\n", " base_job_name=\"bias-model-monitoring\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The method `suggest_baseline` suggests a set of baseline constraints and generate descriptive statistics that characterize the data in a training dataset stored in Amazon S3. It starts a Processing Job using a Model Monitor container provided by Amazon SageMaker to generate the constraints." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: 1.0.\n", "INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.\n", "INFO:sagemaker.clarify:Analysis Config: {'dataset_type': 'text/csv', 'headers': ['>50k', 'RAC1P', 'AGEP', 'WKHP', 'PWGTP', 'COW_1.0', 'COW_2.0', 'COW_3.0', 'COW_4.0', 'COW_5.0', 'COW_6.0', 'COW_7.0', 'COW_8.0', 'SCHL_1.0', 'SCHL_2.0', 'SCHL_3.0', 'SCHL_4.0', 'SCHL_5.0', 'SCHL_6.0', 'SCHL_7.0', 'SCHL_8.0', 'SCHL_9.0', 'SCHL_10.0', 'SCHL_11.0', 'SCHL_12.0', 'SCHL_13.0', 'SCHL_14.0', 'SCHL_15.0', 'SCHL_16.0', 'SCHL_17.0', 'SCHL_18.0', 'SCHL_19.0', 'SCHL_20.0', 'SCHL_21.0', 'SCHL_22.0', 'SCHL_23.0', 'SCHL_24.0', 'MAR_1.0', 'MAR_2.0', 'MAR_3.0', 'MAR_4.0', 'MAR_5.0', 'RELP_0.0', 'RELP_1.0', 'RELP_2.0', 'RELP_3.0', 'RELP_4.0', 'RELP_5.0', 'RELP_6.0', 'RELP_7.0', 'RELP_8.0', 'RELP_9.0', 'RELP_10.0', 'RELP_11.0', 'RELP_12.0', 'RELP_13.0', 'RELP_14.0', 'RELP_15.0', 'RELP_16.0', 'RELP_17.0', 'SEX_2.0', 'SCH_1.0', 'SCH_2.0', 'SCH_3.0'], 'label': '>50k', 'label_values_or_threshold': [1], 'facet': [{'name_or_index': 'RAC1P', 'value_or_threshold': [6]}], 'methods': {'report': {'name': 'report', 'title': 'Analysis Report'}, 'pre_training_bias': {'methods': 'all'}, 'post_training_bias': {'methods': 'all'}}, 'predictor': {'model_name': 'clarify-model-xgb', 'instance_type': 'ml.m5.xlarge', 'initial_instance_count': 1, 'accept_type': 'text/csv', 'content_type': 'text/csv'}, 'probability_threshold': 0.6}\n", "INFO:sagemaker:Creating processing-job with name bias-model-monitoring-2023-04-26-06-04-17-328\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "ModelBiasMonitor baselining job: bias-model-monitoring-2023-04-26-06-04-17-328\n" ] } ], "source": [ "model_bias_monitor.suggest_baseline(\n", " model_config=model_config,\n", " data_config=bias_data_config,\n", " bias_config=bias_config,\n", " model_predicted_label_config=predictions_config,\n", ")\n", "\n", "print(\n", " f\"ModelBiasMonitor baselining job: {model_bias_monitor.latest_baselining_job_name}\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Besides captures, model bias monitoring execution also requires ground truth data. In real use cases, ground truth data should be regularly collected and uploaded to designated S3 location. In this example notebook, we assume that the train data is acting as ground truth." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "ground_truth_data = train_uri" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that you have a baseline and ground truth data, you can call the `create_monitoring_schedule()` method to schedule an hourly monitor to analyze the data with a monitoring schedule. " ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:sagemaker.model_monitor.clarify_model_monitoring:Uploading analysis config to {s3_uri}.\n", "INFO:sagemaker.model_monitor.model_monitoring:Creating Monitoring Schedule with name: bias-model-monitoring-2023-04-26-06-04-17-704\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Model bias monitoring schedule: bias-model-monitoring-2023-04-26-06-04-17-704\n" ] } ], "source": [ "# Set up path for where monitor output should be stored\n", "model_monitor_output_path = \"s3://{}/{}/clarify-monitor\".format(\n", " bucket, prfx\n", ")\n", "\n", "model_bias_analysis_config = None\n", "if not model_bias_monitor.latest_baselining_job:\n", " model_bias_analysis_config = model_monitor.BiasAnalysisConfig(\n", " bias_data_config,\n", " label=model_target,\n", " headers=[model_target] + sensitive_feature + numerical_features + cat_ft_names,\n", " )\n", "\n", "model_bias_monitor.create_monitoring_schedule(\n", " analysis_config=model_bias_analysis_config,\n", " output_s3_uri=model_monitor_output_path,\n", " ground_truth_input=ground_truth_data, \n", " endpoint_input=endpointname,\n", " schedule_cron_expression=model_monitor.CronExpressionGenerator.hourly(),\n", ")\n", "print(f\"Model bias monitoring schedule: {model_bias_monitor.monitoring_schedule_name}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clean Up\n", "Finally, don't forget to clean up the resources we set up and used for this notebook!" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Stopping Monitoring Schedule with name: bias-model-monitoring-2023-04-26-06-04-17-704\n" ] } ], "source": [ "try:\n", " model_bias_monitor.stop_monitoring_schedule()\n", " model_bias_monitor.delete_monitoring_schedule()\n", " Session().delete_endpoint(endpointname)\n", " Session().delete_endpoint_config(endpointname)\n", " Session().delete_model(modelname)\n", "except:\n", " pass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the end of the notebook." ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" } }, "nbformat": 4, "nbformat_minor": 4 }