{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fairness and Explainability with SageMaker Clarify - Bring Your Own Container"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
    "\n",
    "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability_byoc.ipynb)\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Runtime\n",
    "\n",
    "This notebook takes approximately 30 minutes to run.\n",
    "\n",
    "## Contents\n",
    "1. [Overview](#Overview)\n",
    "1. [Prerequisites and Data](#Prerequisites-and-Data)\n",
    "    1. [Initialize SageMaker](#Initialize-SageMaker)\n",
    "    1. [Download data](#Download-data)\n",
    "    1. [Loading the data: Adult Dataset](#Loading-the-data:-Adult-Dataset) \n",
    "    1. [Data inspection](#Data-inspection) \n",
    "    1. [Encode and Upload the Dataset](#Encode-and-Upload-the-Dataset)\n",
    "    1. [Samples for Inference](#Samples-for-Inference)\n",
    "1. [Build Container](#Build-Container)\n",
    "    1. [Container Source Code](#Container-Source-Code)\n",
    "        1. [The Dockerfile](#The-Dockerfile)\n",
    "        1. [The train Script](#The-train-Script)\n",
    "        1. [The serve Script](#The-serve-Script)\n",
    "    1. [Local Debugging](#Local-Debugging)\n",
    "    1. [Build and Push](#Build-and-Push)\n",
    "1. [Train Model](#Train-Model)\n",
    "    1. [Train](#Train)\n",
    "    1. [Deploy](#Deploy)\n",
    "    1. [Verification](#Verification)\n",
    "1. [Amazon SageMaker Clarify](Amazon-SageMaker-Clarify)\n",
    "    1. [Detecting Bias](#Detecting-Bias)\n",
    "        1. [Writing DataConfig](#Writing-DataConfig)\n",
    "        1. [Writing ModelConfig](#Writing-ModelConfig)\n",
    "        1. [Writing BiasConfig](#Writing-BiasConfig)\n",
    "        1. [Writing ModelPredictedLabelConfig](#Writing-ModelPredictedLabelConfig)\n",
    "        1. [Pre-training Bias](#Pre-training-Bias)\n",
    "        1. [Post-training Bias](#Post-training-Bias)\n",
    "        1. [Viewing the Bias Report](#Viewing-the-Bias-Report)\n",
    "    1. [Explaining Predictions](#Explaining-Predictions)\n",
    "        1. [Viewing the Explainability Report](#Viewing-the-Explainability-Report)\n",
    "1. [Clean Up](#Clean-Up)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "Amazon SageMaker Clarify helps improve your machine learning models by detecting potential bias and helping explain how these models make predictions. The fairness and explainability functionality provided by SageMaker Clarify takes a step towards enabling AWS customers to build trustworthy and understandable machine learning models. The product comes with the tools to help you with the following tasks.\n",
    "\n",
    "* Measure biases that can occur during each stage of the ML lifecycle (data collection, model training and tuning, and monitoring of ML models deployed for inference).\n",
    "* Generate model governance reports targeting risk and compliance teams and external regulators.\n",
    "* Provide explanations of the data, models, and monitoring used to assess predictions.\n",
    "\n",
    "In order to compute post-training bias metrics and explainability, SageMaker Clarify needs to get inferences from the SageMaker model provided by the `model_name` parameter of Clarify [analysis configuration](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-configure-processing-jobs.html#clarify-processing-job-configure-analysis) (or the same parameter of the [ModelConfig](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html?highlight=Processor#sagemaker.clarify.ModelConfig) if you use [SageMakerClarifyProcessor](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.SageMakerClarifyProcessor) API). To accomplish this, the Clarify job creates an ephemeral endpoint with the model, known as a shadow endpoint. The model and the Clarify job should follow certain contracts so that they can work together smoothly.\n",
    "\n",
    "This sample notebook introduces key terms and concepts needed to understand SageMaker Clarify, and it walks you through an end-to-end data science workflow demonstrating how to **build your own model and container that can work seamlessly with your Clarify jobs**, use the model and SageMaker Clarify to measure bias, explain the importance of the various input features on the model's decision and then access the reports through SageMaker Studio if you have an instance set up."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites and Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Initialize SageMaker"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import json\n",
    "import os\n",
    "import sagemaker\n",
    "import boto3\n",
    "from datetime import datetime\n",
    "\n",
    "session = sagemaker.Session()\n",
    "bucket = session.default_bucket()\n",
    "prefix = \"sagemaker/DEMO-sagemaker-clarify-byoc\"\n",
    "\n",
    "role = sagemaker.get_execution_role()\n",
    "account_id = role.split(\":\")[4]\n",
    "region = session.boto_region_name\n",
    "if region.startswith(\"cn-\"):\n",
    "    uri_suffix = \"amazonaws.com.cn\"\n",
    "    arn_partition = \"aws-cn\"\n",
    "else:\n",
    "    uri_suffix = \"amazonaws.com\"\n",
    "    arn_partition = \"aws\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download data\n",
    "Data Source: [https://archive.ics.uci.edu/ml/machine-learning-databases/adult/](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/)\n",
    "\n",
    "Let's __download__ the data and save it in the local folder with the name adult.data and adult.test from UCI repository$^{[2]}$.\n",
    "\n",
    "$^{[2]}$Dua Dheeru, and Efi Karra Taniskidou. \"[UCI Machine Learning Repository](http://archive.ics.uci.edu/ml)\". Irvine, CA: University of California, School of Information and Computer Science (2017)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adult_columns = [\n",
    "    \"Age\",\n",
    "    \"Workclass\",\n",
    "    \"fnlwgt\",\n",
    "    \"Education\",\n",
    "    \"Education-Num\",\n",
    "    \"Marital Status\",\n",
    "    \"Occupation\",\n",
    "    \"Relationship\",\n",
    "    \"Ethnic group\",\n",
    "    \"Sex\",\n",
    "    \"Capital Gain\",\n",
    "    \"Capital Loss\",\n",
    "    \"Hours per week\",\n",
    "    \"Country\",\n",
    "    \"Target\",\n",
    "]\n",
    "\n",
    "s3 = boto3.client(\"s3\")\n",
    "s3.download_file(\n",
    "    f\"sagemaker-example-files-prod-{region}\", \"datasets/tabular/uci_adult/adult.data\", \"adult.data\"\n",
    ")\n",
    "s3.download_file(\n",
    "    f\"sagemaker-example-files-prod-{region}\", \"datasets/tabular/uci_adult/adult.test\", \"adult.test\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading the data: Adult Dataset\n",
    "From the UCI repository of machine learning datasets, this database contains 14 features concerning demographic characteristics of 45,222 rows (32,561 for training and 12,661 for testing). The task is to predict whether a person has a yearly income that is more or less than $50,000.\n",
    "\n",
    "Here are the features and their possible values:\n",
    "\n",
    "1. **Age**: continuous.\n",
    "1. **Workclass**: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.\n",
    "1. **Fnlwgt**: continuous (the number of people the census takers believe that observation represents).\n",
    "1. **Education**: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.\n",
    "1. **Education-num**: continuous.\n",
    "1. **Marital-status**: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.\n",
    "1. **Occupation**: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.\n",
    "1. **Relationship**: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.\n",
    "1. **Ethnic group**: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.\n",
    "1. **Sex**: Female, Male.\n",
    "    * **Note**: this data is extracted from the 1994 Census and enforces a binary option on Sex\n",
    "1. **Capital-gain**: continuous.\n",
    "1. **Capital-loss**: continuous.\n",
    "1. **Hours-per-week**: continuous.\n",
    "1. **Native-country**: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.\n",
    "\n",
    "Next, we specify our binary prediction task:  \n",
    "15. **Target**: <=50,000, >$50,000."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "training_data = pd.read_csv(\n",
    "    \"adult.data\", names=adult_columns, sep=r\"\\s*,\\s*\", engine=\"python\", na_values=\"?\"\n",
    ").dropna()\n",
    "\n",
    "testing_data = pd.read_csv(\n",
    "    \"adult.test\", names=adult_columns, sep=r\"\\s*,\\s*\", engine=\"python\", na_values=\"?\", skiprows=1\n",
    ").dropna()\n",
    "\n",
    "training_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data inspection\n",
    "Plotting histograms for the distribution of the different features is a good way to visualize the data. Let's plot a few of the features that can be considered _sensitive_.  \n",
    "Let's take a look specifically at the Sex feature of a census respondent. In the first plot we see that there are fewer Female respondents as a whole but especially in the positive outcomes, where they form ~$\\frac{1}{7}$th of respondents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "training_data[\"Sex\"].value_counts().sort_values().plot(kind=\"bar\", title=\"Counts of Sex\", rot=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "training_data[\"Sex\"].where(training_data[\"Target\"] == \">50K\").value_counts().sort_values().plot(\n",
    "    kind=\"bar\", title=\"Counts of Sex earning >$50K\", rot=0\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Encode and Upload the Dataset\n",
    "Here we encode the training and test data. Encoding input data is not necessary for SageMaker Clarify, but is necessary for the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import preprocessing\n",
    "\n",
    "\n",
    "def number_encode_features(df):\n",
    "    result = df.copy()\n",
    "    encoders = {}\n",
    "    for column in result.columns:\n",
    "        if result.dtypes[column] == np.object:\n",
    "            encoders[column] = preprocessing.LabelEncoder()\n",
    "            #  print('Column:', column, result[column])\n",
    "            result[column] = encoders[column].fit_transform(result[column].fillna(\"None\"))\n",
    "    return result, encoders\n",
    "\n",
    "\n",
    "training_data = pd.concat([training_data[\"Target\"], training_data.drop([\"Target\"], axis=1)], axis=1)\n",
    "training_data, _ = number_encode_features(training_data)\n",
    "training_data.to_csv(\"train_data.csv\", index=False, header=False)\n",
    "\n",
    "testing_data, _ = number_encode_features(testing_data)\n",
    "test_features = testing_data.drop([\"Target\"], axis=1)\n",
    "test_target = testing_data[\"Target\"]\n",
    "test_features.to_csv(\"test_features.csv\", index=False, header=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A quick note about our encoding: the \"Female\" Sex value has been encoded as 0 and \"Male\" as 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "training_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lastly, let's upload the data to S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.s3 import S3Uploader\n",
    "from sagemaker.inputs import TrainingInput\n",
    "\n",
    "train_uri = S3Uploader.upload(\"train_data.csv\", \"s3://{}/{}\".format(bucket, prefix))\n",
    "train_input = TrainingInput(train_uri, content_type=\"csv\")\n",
    "test_uri = S3Uploader.upload(\"test_features.csv\", \"s3://{}/{}\".format(bucket, prefix))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Samples for Inference\n",
    "\n",
    "Pick up some samples from the test dataset, later they will be used to test the real-time inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample = test_features.loc[0, :].values.tolist()\n",
    "samples = test_features.loc[0:5, :].values.tolist()\n",
    "\n",
    "\n",
    "def convert_to_csv_payload(samples):\n",
    "    return \"\\n\".join([\",\".join([str(feature) for feature in sample]) for sample in samples])\n",
    "\n",
    "\n",
    "def convert_to_jsonlines_payload(samples):\n",
    "    return \"\\n\".join(\n",
    "        [json.dumps({\"features\": sample}, separators=(\",\", \":\")) for sample in samples]\n",
    "    )\n",
    "\n",
    "\n",
    "command_parameters = [\n",
    "    [\"text/csv\", convert_to_csv_payload([sample])],\n",
    "    [\"text/csv\", convert_to_csv_payload(samples)],  # for batch request\n",
    "    [\"application/jsonlines\", convert_to_jsonlines_payload([sample])],\n",
    "    [\"application/jsonlines\", convert_to_jsonlines_payload(samples)],  # for batch request\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build Container\n",
    "\n",
    "This section introduces how to build your custom container. For simplicity, a single container is built to serve two purposes: it can be used by SageMaker Training job for training your custom model, as well as being deployed by SageMaker Hosting service for real-time inference."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Container Source Code\n",
    "\n",
    "There are three source files in the container subfolder."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The Dockerfile\n",
    "\n",
    "The Dockerfile describes the image that you want to build. You can think of it as describing the complete operating system installation of the system that you want to run. A Docker container running is quite a bit lighter than a full operating system, however, because it takes advantage of Linux on the host machine for the basic operations.\n",
    "\n",
    "The following Dockerfile starts from a [miniconda3 image](https://hub.docker.com/r/continuumio/miniconda3) and runs the normal tools to install `scikit-learn` and `pandas` for data science operations, and install `flask` for building a simple web application to serve real-time inference. Then it adds the code that implements the training algorithm and the real-time inference logic, and informs Docker that the container listens on the specified network ports at runtime."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!echo\n",
    "!cat container/Dockerfile | sed 's/^/    /'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The train Script\n",
    "\n",
    "The `train` script implements the training algorithm. It is packaged to docker image which will be pushed to ECR (Elastic Container Registry) under your account. When triggering a SageMaker training job, your requested SageMaker instance will pull that image from your ECR and execute it with the data you specified in an S3 URI.\n",
    "\n",
    "It is important to know how SageMaker runs your image. For training job, SageMaker runs your image like\n",
    "\n",
    "    docker run <image> train\n",
    "\n",
    "This is why your image needs to have the executable `train` to start the model training process. See [Use Your Own Training Algorithms\n",
    "](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html) for more explanations on how Amazon SageMaker interacts with a Docker container that runs your custom training algorithm. \n",
    "\n",
    "The following script does the below steps in sequence:\n",
    "\n",
    "* Parses command line parameters. In training job environment, SageMaker downloads data files and save them to local directory `/opt/ml/input`. For example, if the training dataset channel specified to the fit() method on client side is `train`, then the training dataset will be saved to folder `/opt/ml/input/train`. The model output directory is always `/opt/ml/model`.\n",
    "* Load training dataset. Here assume that the data files are in CSV format, and the first column is the label column.\n",
    "* Train a [sklearn.linear_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) estimator.\n",
    "* Dump the estimator's model to a model file.\n",
    "\n",
    "The script is built from scratch for demonstration purpose, so it has to take care of many details. For example, if you want to get [hyperparameters specified on client side](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator.set_hyperparameters), then the script should be updated to read them from `/opt/ml/input/config/hyperparameters.json`. One option to get rid of the details and focus on algorithms is integrating [SageMaker Training Toolkit](https://github.com/aws/sagemaker-training-toolkit) to your image, the toolkit gives you tools to create SageMaker-compatible Docker containers, and has additional tools for letting you create Frameworks (SageMaker-compatible Docker containers that can run arbitrary Python or shell scripts)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!echo\n",
    "!cat container/train | sed 's/^/    /'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The serve Script\n",
    "\n",
    "The `serve` script implements the real-time inference logic. When SageMaker deploys your image to a real-time inference instance, it runs your image as,\n",
    "\n",
    "    docker run <image> serve\n",
    "\n",
    "The script is supposed to set up a web server that responds to `/invocations` and `/ping` on port 8080. See [Use Your Own Inference Code with Hosting Services](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html) for more explanations on how Amazon SageMaker interacts with a Docker container that runs your own inference code for hosting services.\n",
    "\n",
    "The following script uses [flask](https://github.com/pallets/flask) to implement a simple web server,\n",
    "\n",
    "* At container startup, the script initializes an estimator using the model file provided by the client side deploy() method. The model directory and model file name are the same as in the `train` script.\n",
    "* Once started, the server is ready to serve inference requests. The logic resides in the `predict` method,\n",
    "    * Input validation. The example container supports the same MIME types as Clarify job does, i.e., `text/csv` and `application/jsonlines`.\n",
    "    * Parse payload. Clarify job may send **batch requests** to the container for better efficiency, i.e., the payload can have multiple lines and each is a sample. So, the method decodes request payload and then split lines, then loads the lines according to the content type. For JSON Lines content, the method uses a key \"features\" to extract the list of features from a JSON line. The key shall be the same as the one defined in your Clarify job analysis configuration `predictor.content_template`. It is a **contract** between the Clarify job and the container, here you can change it to something else, like \"attributes\", but remember to update the `predictor.content_template` configuration accordingly.\n",
    "    * Do prediction. The method gets the probability scores instead of binary labels, because scores are better for feature explainability.\n",
    "    * Format output. For a **batch request**, Clarify job expects the same number of result lines as the number of samples in the request. So, the method encodes each prediction and then join them by line-break. For JSON Lines accept type, the method uses two keys \"predicted_label\" and \"score\" to indicate the prediction. The keys shall be the same as your Clarify job analysis configuration `predictor.label` and `predictor.probability`, and they are used by the Clarify job to extract predictions from container response payload. The keys are **contracts** between the Clarify job and the container, here you can change them to something else, but remember to update the analysis configuration accordingly.\n",
    "\n",
    "Similarly, the script is built from scratch for demonstration purpose. In a real project, you can utilize [SageMaker Inference Toolkit](https://github.com/aws/sagemaker-inference-toolkit) which implements a model serving stack built on [Multi Model Server](https://github.com/awslabs/multi-model-server), and it can serve your own models or those you trained on SageMaker using Machine Learning frameworks with native SageMaker support."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!echo\n",
    "!cat container/serve | sed 's/^/    /'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Local Debugging\n",
    "\n",
    "This section has some tips for debugging the container code locally. Considering that image build, push and deployment take time to complete, it is important to first test the container code thoroughly locally to save time. (Although you can safely skip it in this exercise because the container code is already functional.)\n",
    "\n",
    "As an example, you can download the container folder and dataset files to your local machine, setup Python development environment and install necessary dependencies (found in the Dockerfile), then import the code to your favorite IDE for editing/debugging."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `train` script can be executed as,\n",
    "\n",
    "```\n",
    "python train --train_dir <dataset folder> --model_dir <model folder>\n",
    "```\n",
    "\n",
    "Upon successful execution, the script should generate a model file `model.joblib` to the model folder."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And then the `serve` script can be executed as,\n",
    "\n",
    "```\n",
    "python serve --model_dir <model folder>\n",
    "```\n",
    "\n",
    "Upon successful execution, the script should be listening on local host port `8080` for inference requests. The following cell generates a few CURL commands to send inference requests (both CSV and JSON Lines) to the port. You can copy & paste them to your local terminal for execution, to hit the port and trigger the inference code. For a single sample request, the command should output only one result, and for a batch request, the command should output the same number of results (lines) as the number of samples in the request."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"\\n\")\n",
    "for mime_type, payload in command_parameters:\n",
    "    command = f\"    curl -X POST -H 'Content-Type: {mime_type}' -H 'Accept: {mime_type}' -d ${repr(payload)} http://0.0.0.0:8080/invocations\"\n",
    "    print(command)\n",
    "    print(\"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you have Docker installed locally, you can build image like this (the -t option specifies image repository and tag),\n",
    "\n",
    "```\n",
    "docker build container -t bring-your-own-container:latest\n",
    "```\n",
    "\n",
    "Then run the image for training (the -v option maps a folder of your local machine to the docker container),\n",
    "\n",
    "```\n",
    "docker run -v /Local/Machine/Folder:/BYOC bring-your-own-container:latest train --train_dir /BYOC/dataset --model_dir /BYOC/model\n",
    "```\n",
    "\n",
    "And then run it for inferring (the -p option maps a local machine port to the docker container),\n",
    "\n",
    "```\n",
    "docker run -v /Local/Machine/Folder:/BYOC -p 8080:8080 bring-your-own-container:latest serve --model_dir /BYOC/model\n",
    "```\n",
    "\n",
    "The docker image can be pushed to ECR manually, see [Building your own algorithm container](https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb) for more details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build and Push\n",
    "\n",
    "To avoid manual operations in your local development environment. This notebook will use [SageMaker Docker Build CLI](https://github.com/aws-samples/sagemaker-studio-image-build-cli) to automatically build and push the container to ECR for you. The tool uses ECR and AWS CodeBuild, so it requires that the role to execute the tool has the necessary policies and permissions attached. For simplicity, you can update the SageMaker Execution Role attached to this notebook with the required permissions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "role"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Ensure that the role has the following permissions before you continue!**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Add or merge below policy to the Trust relationships of the role\n",
    "\n",
    "```\n",
    "{\n",
    "    \"Version\": \"2012-10-17\",\n",
    "    \"Statement\": [\n",
    "        {\n",
    "            \"Effect\": \"Allow\",\n",
    "            \"Principal\": {\n",
    "                \"Service\": [\n",
    "                    \"codebuild.amazonaws.com\"\n",
    "                ]\n",
    "            },\n",
    "            \"Action\": \"sts:AssumeRole\"\n",
    "        }\n",
    "    ]\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Add an inline policy to the role (execute the cell below to view the policy)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from string import Template\n",
    "\n",
    "template = Template(\n",
    "    \"\"\"{\n",
    "    \"Version\": \"2012-10-17\",\n",
    "    \"Statement\": [\n",
    "        {\n",
    "            \"Effect\": \"Allow\",\n",
    "            \"Action\": [\n",
    "                \"codebuild:DeleteProject\",\n",
    "                \"codebuild:CreateProject\",\n",
    "                \"codebuild:BatchGetBuilds\",\n",
    "                \"codebuild:StartBuild\"\n",
    "            ],\n",
    "            \"Resource\": \"arn:$partition:codebuild:*:*:project/sagemaker-studio*\"\n",
    "        },\n",
    "        {\n",
    "            \"Effect\": \"Allow\",\n",
    "            \"Action\": \"logs:CreateLogStream\",\n",
    "            \"Resource\": \"arn:$partition:logs:*:*:log-group:/aws/codebuild/sagemaker-studio*\"\n",
    "        },\n",
    "        {\n",
    "            \"Effect\": \"Allow\",\n",
    "            \"Action\": [\n",
    "                \"logs:GetLogEvents\",\n",
    "                \"logs:PutLogEvents\"\n",
    "            ],\n",
    "            \"Resource\": \"arn:$partition:logs:*:*:log-group:/aws/codebuild/sagemaker-studio*:log-stream:*\"\n",
    "        },\n",
    "        {\n",
    "            \"Effect\": \"Allow\",\n",
    "            \"Action\": \"logs:CreateLogGroup\",\n",
    "            \"Resource\": \"*\"\n",
    "        },\n",
    "        {\n",
    "            \"Effect\": \"Allow\",\n",
    "            \"Action\": [\n",
    "                \"ecr:CreateRepository\",\n",
    "                \"ecr:BatchGetImage\",\n",
    "                \"ecr:CompleteLayerUpload\",\n",
    "                \"ecr:DescribeImages\",\n",
    "                \"ecr:DescribeRepositories\",\n",
    "                \"ecr:UploadLayerPart\",\n",
    "                \"ecr:ListImages\",\n",
    "                \"ecr:InitiateLayerUpload\",\n",
    "                \"ecr:BatchCheckLayerAvailability\",\n",
    "                \"ecr:PutImage\"\n",
    "            ],\n",
    "            \"Resource\": \"arn:$partition:ecr:*:*:repository/sagemaker-studio*\"\n",
    "        },\n",
    "        {\n",
    "            \"Effect\": \"Allow\",\n",
    "            \"Action\": \"ecr:GetAuthorizationToken\",\n",
    "            \"Resource\": \"*\"\n",
    "        },\n",
    "        {\n",
    "            \"Effect\": \"Allow\",\n",
    "            \"Action\": [\n",
    "                \"s3:GetObject\",\n",
    "                \"s3:DeleteObject\",\n",
    "                \"s3:PutObject\"\n",
    "            ],\n",
    "            \"Resource\": \"arn:$partition:s3:::sagemaker-*/*\"\n",
    "        },\n",
    "        {\n",
    "            \"Effect\": \"Allow\",\n",
    "            \"Action\": [\n",
    "                \"s3:CreateBucket\"\n",
    "            ],\n",
    "            \"Resource\": \"arn:$partition:s3:::sagemaker*\"\n",
    "        },\n",
    "        {\n",
    "            \"Effect\": \"Allow\",\n",
    "            \"Action\": [\n",
    "                \"iam:GetRole\",\n",
    "                \"iam:ListRoles\"\n",
    "            ],\n",
    "            \"Resource\": \"*\"\n",
    "        },\n",
    "        {\n",
    "            \"Effect\": \"Allow\",\n",
    "            \"Action\": \"iam:PassRole\",\n",
    "            \"Resource\": \"$execution_role\",\n",
    "            \"Condition\": {\n",
    "                \"StringLikeIfExists\": {\n",
    "                    \"iam:PassedToService\": \"codebuild.amazonaws.com\"\n",
    "                }\n",
    "            }\n",
    "        }\n",
    "    ]\n",
    "}\"\"\"\n",
    ")\n",
    "permissions_policy = template.substitute(\n",
    "    partition=arn_partition, account_id=account_id, execution_role=role\n",
    ")\n",
    "print(permissions_policy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the permissions are attached to the role, install the tool by,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install sagemaker-studio-image-build --quiet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now define the ECR repository and tag, note that **the repository name must have the prefix sagemaker-studio** which is covered by above permissions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "byoc_repository = \"sagemaker-studio-byoc\"\n",
    "byoc_tag = \"latest\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then the build and push can be done by a single command, Build step can take about 5 minutes to complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!sm-docker build container --repository $byoc_repository:$byoc_tag --no-logs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The command should have pushed the image to below URI,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "byoc_image_uri = \"{}.dkr.ecr.{}.{}/{}:{}\".format(\n",
    "    account_id, region, uri_suffix, byoc_repository, byoc_tag\n",
    ")\n",
    "print(f\"Image URI: {byoc_image_uri}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train\n",
    "\n",
    "Now you have a docker image that includes the logic of your model training, and the training data are available to SageMaker on S3. It is high time to train the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The job takes about 10 minutes to run\n",
    "estimator = sagemaker.estimator.Estimator(\n",
    "    image_uri=byoc_image_uri,\n",
    "    role=role,\n",
    "    instance_count=1,\n",
    "    instance_type=\"ml.m5.xlarge\",\n",
    "    sagemaker_session=session,\n",
    ")\n",
    "estimator.fit({\"train\": train_input}, logs=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The trained model should have been uploaded to S3 as,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Model file: {estimator.model_data}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deploy\n",
    "\n",
    "The model file should be deployed as a SageMaker Model which can be used in Clarify post-training bias analysis and feature explanation. The following code creates the model, and then deploys it to an inference host/endpoint for verification."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_name = \"DEMO-clarify-byoc-model-{}\".format(datetime.now().strftime(\"%d-%m-%Y-%H-%M-%S\"))\n",
    "endpoint_name = \"DEMO-clarify-byoc-endpoint-{}\".format(datetime.now().strftime(\"%d-%m-%Y-%H-%M-%S\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor = estimator.deploy(\n",
    "    initial_instance_count=1,\n",
    "    instance_type=\"ml.m5.xlarge\",\n",
    "    endpoint_name=endpoint_name,\n",
    "    model_name=model_name,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Verification\n",
    "\n",
    "A verification is necessary to make sure that the custom model and container follow the contracts with your Clarify jobs. The [AWS CLI](https://aws.amazon.com/cli/) tool is recommended for the test, it is preinstalled in SageMaker Studio and can be used to invoke the endpoint directly with raw payload, avoid intermediate processing steps in wrapper APIs like the [SageMaker Python SDK Predictor class](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html).\n",
    "\n",
    "The following code generates a few AWS CLI commands to send inference requests to the endpoint, and also executes them in the notebook to get the results. You can copy & paste the commands to a Studio Terminal (File > New > Terminal), or to your local terminal, for execution and double-check the results. You can see, for a single sample request, the command outputs only one result, and for a batch request, the command outputs the same number of results (lines) as the number of samples in the request.\n",
    "\n",
    "Some tips:\n",
    "\n",
    "* If you use AWS CLI v2, then an additional parameter `--cli-binary-format raw-in-base64-out` should be added to the command. See [cli_binary_format](https://docs.aws.amazon.com/credref/latest/refdocs/setting-global-cli_binary_format.html#setting-cli_binary_format-alternatives) for the reason.\n",
    "* To send batch requests, add `$` before the payload (`--body`) string to unescape the line-break character ('\\n')."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import subprocess\n",
    "import re\n",
    "\n",
    "aws_cli_version = subprocess.run([\"aws\", \"--version\"], capture_output=True, text=True).stdout\n",
    "aws_cli_major_version = re.match(\"aws-cli/(\\d+).+\", aws_cli_version).group(1)\n",
    "\n",
    "if aws_cli_major_version == \"1\":\n",
    "    cli_binary_format = \"\"\n",
    "else:\n",
    "    # https://docs.aws.amazon.com/credref/latest/refdocs/setting-global-cli_binary_format.html\n",
    "    cli_binary_format = \"--cli-binary-format raw-in-base64-out\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from string import Template\n",
    "\n",
    "for mime_type, payload in command_parameters:\n",
    "    template = Template(\n",
    "        f\"aws sagemaker-runtime invoke-endpoint --endpoint-name {endpoint_name} --content-type {mime_type} --accept {mime_type} --body $payload {cli_binary_format} /dev/stderr 1>/dev/null\"\n",
    "    )\n",
    "    command = template.substitute(payload=f\"${repr(payload)}\")\n",
    "    print(command)\n",
    "    command = template.substitute(payload=f\"'{payload}'\")\n",
    "    output = subprocess.run(command, shell=True, capture_output=True, text=True).stderr\n",
    "    print(output)\n",
    "    print(\"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the verification is done, you can delete endpoint, but keep the model for Clarify jobs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor.delete_endpoint()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Amazon SageMaker Clarify\n",
    "\n",
    "With your model set up, it's time to explore SageMaker Clarify. For a general overview of how SageMaker Clarify processing jobs work, refer [the provided link](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-processing-job-configure-how-it-works.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker import clarify\n",
    "\n",
    "# Initialize a SageMakerClarifyProcessor to compute bias metrics and model explanations.\n",
    "clarify_processor = clarify.SageMakerClarifyProcessor(\n",
    "    role=role, instance_count=1, instance_type=\"ml.m5.xlarge\", sagemaker_session=session\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are three scenarios where Clarify handles data types, and they all support both CSV (`text/csv`) and JSON Lines (`application/jsonlines`).\n",
    "\n",
    "* dataset type: the MIME type of the dataset and SHAP baseline.\n",
    "* content type: the MIME type of the shadow endpoint request payload\n",
    "* accept type: the MIME type of the shadow endpoint response payload\n",
    "\n",
    "The Clarify jobs in this notebook always uses CSV for dataset type, but you can choose for the other two. The following code chose JSON Lines for both, but it is fine if you change one of them or both of them to CSV, because CSV and JSON Lines are supported by the customer container as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "content_type = \"application/jsonlines\"  # could be 'text/csv'\n",
    "accept_type = \"application/jsonlines\"  # could be 'text/csv'\n",
    "\n",
    "if content_type == \"text/csv\":\n",
    "    content_template = None\n",
    "else:  # 'application/jsonlines'\n",
    "    content_template = '{\"features\":$features}'\n",
    "\n",
    "probability_threshold = 0.4\n",
    "if accept_type == \"text/csv\":\n",
    "    probability = None\n",
    "else:  # 'application/jsonlines'\n",
    "    probability = \"score\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Detecting Bias\n",
    "\n",
    "SageMaker Clarify helps you detect possible [pre-training](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-detect-data-bias.html) and [post-training](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-detect-post-training-bias.html) biases using a variety of metrics.\n",
    "\n",
    "#### Writing DataConfig\n",
    "A [DataConfig](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.DataConfig) object communicates some basic information about data I/O to SageMaker Clarify. For our example here we provide the below information:\n",
    "\n",
    "* `s3_data_input_path`: S3 URI of the train dataset we uploaded above\n",
    "* `s3_output_path`: S3 URI at which our output report will be uploaded\n",
    "* `label`: Specifies the ground truth label, which is also known as observed label or target attribute. It is used for many bias metrics. In this example, the `Target` column has the ground truth label.\n",
    "* `headers`: The list of column names in the dataset\n",
    "* `dataset_type`: specifies the format of your dataset, for this example as we are using CSV dataset this will be `text/csv`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bias_report_output_path = \"s3://{}/{}/clarify-bias\".format(bucket, prefix)\n",
    "bias_data_config = clarify.DataConfig(\n",
    "    s3_data_input_path=train_uri,\n",
    "    s3_output_path=bias_report_output_path,\n",
    "    label=\"Target\",\n",
    "    headers=training_data.columns.to_list(),\n",
    "    dataset_type=\"text/csv\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Writing ModelConfig\n",
    "A [ModelConfig](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.ModelConfig) object communicates information about your trained model. To avoid additional traffic to the production models, SageMaker Clarify sets up and tears down a dedicated endpoint when processing. For our example here we provide the below information:\n",
    "\n",
    "* `instance_type` and `instance_count` specify your preferred instance type and instance count used to run your model on during SageMaker Clarify's processing. The testing dataset is small, so a single standard instance is good enough to run this example. If you have a large complex dataset, you may want to use a better instance type to speed up, or add more instances to enable Spark parallelization.\n",
    "* `accept_type` denotes the endpoint response payload format, and `content_type` denotes the payload format of request to the endpoint.\n",
    "* `content_template` is used by SageMaker Clarify to compose the request payload if the content type is JSON Lines. To be more specific, the placeholder `$features` will be replaced by the features list from samples. For example, the first sample of the test dataset is `25,2,226802,1,7,4,6,3,2,1,0,0,40,37`, so the corresponding request payload is `'{\"features\":[25,2,226802,1,7,4,6,3,2,1,0,0,40,37]}'`, which conforms to [SageMaker JSON Lines dense format](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html#common-in-formats)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_config = clarify.ModelConfig(\n",
    "    model_name=model_name,\n",
    "    instance_type=\"ml.m5.xlarge\",\n",
    "    instance_count=1,\n",
    "    accept_type=accept_type,\n",
    "    content_type=content_type,\n",
    "    content_template=content_template,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Writing ModelPredictedLabelConfig\n",
    "\n",
    "A [ModelPredictedLabelConfig](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.ModelPredictedLabelConfig) provides information on the format of your predictions.\n",
    "\n",
    "* `probability` is used by SageMaker Clarify to locate the probability score in endpoint response if the accept type is JSON Lines. In this case, the response payload for a single sample request looks like `'{\"predicted_label\": 0, \"score\": 0.026494730307781475}'`, so SageMaker Clarify can find the score `0.026494730307781475` by JSONPath `'score'`.\n",
    "* `probability_threshold` is used by SageMaker Clarify to convert the probability to binary labels for bias analysis. Prediction above the threshold is interpreted as label value 1 and below or equal as label value 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions_config = clarify.ModelPredictedLabelConfig(\n",
    "    probability=probability, probability_threshold=probability_threshold\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Writing BiasConfig\n",
    "[BiasConfig](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.BiasConfig) contains configuration values for detecting bias using a Clarify container."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bias_config = clarify.BiasConfig(\n",
    "    label_values_or_threshold=[1], facet_name=\"Sex\", facet_values_or_threshold=[0], group_name=\"Age\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For our demo we provide the following information in BiasConfig API:\n",
    "\n",
    "* `label_values_or_threshold`: List of label value(s) or threshold to indicate positive outcome used for bias metrics. Here positive outcome is earning >$50,000.\n",
    "* `facet_name`: Sensitive columns of the dataset, \"Sex\" is the category\n",
    "* `facet_values_or_threshold`: values of the sensitive group, \"Female\" respondents are the sensitive group.\n",
    "* `group_name`: This example has selected the \"Age\" column which is used to form subgroups for the measurement of bias metric [Conditional Demographic Disparity (CDD)](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-data-bias-metric-cddl.html) or [Conditional Demographic Disparity in Predicted Labels (CDDPL)](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-post-training-bias-metric-cddpl.html).\n",
    "\n",
    "SageMaker Clarify can handle both categorical and continuous data for `facet: values_or_threshold` and for `label_values_or_threshold`. In this case we are using categorical data. The results will show if the model has a preference for records of one sex over the other."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Pre-training Bias\n",
    "Bias can be present in your data before any model training occurs. Inspecting your data for bias before training begins can help detect any data collection gaps, inform your feature engineering, and help you understand what societal biases the data may reflect.\n",
    "\n",
    "Computing pre-training bias metrics does not require a trained model.\n",
    "\n",
    "#### Post-training Bias\n",
    "Computing post-training bias metrics does require a trained model.\n",
    "\n",
    "Unbiased training data (as determined by concepts of fairness measured by bias metric) may still result in biased model predictions after training. Whether this occurs depends on several factors including hyperparameter choices.\n",
    "\n",
    "\n",
    "You can run these options separately with `run_pre_training_bias()` and `run_post_training_bias()` or at the same time with `run_bias()` as shown below. We use following additional parameters for the api call:\n",
    "\n",
    "* `pre_training_methods`: Pre-training bias metrics to be computed. The detailed description of the metrics can be found on [Measure Pre-training Bias](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html). This example sets methods to \"all\" to compute all the pre-training bias metrics.\n",
    "* `post_training_methods`: Post-training bias metrics to be computed. The detailed description of the metrics can be found on [Measure Post-training Bias](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-detect-post-training-bias.html). This example sets methods to \"all\" to compute all the post-training bias metrics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The job takes about 10 minutes to run\n",
    "clarify_processor.run_bias(\n",
    "    data_config=bias_data_config,\n",
    "    bias_config=bias_config,\n",
    "    model_config=model_config,\n",
    "    model_predicted_label_config=predictions_config,\n",
    "    pre_training_methods=\"all\",\n",
    "    post_training_methods=\"all\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Viewing the Bias Report\n",
    "In Studio, you can view the results under the experiments tab.\n",
    "\n",
    "<img src=\"./recordings/bias_report.gif\">\n",
    "\n",
    "Each bias metric has detailed explanations with examples that you can explore.\n",
    "\n",
    "<img src=\"./recordings/bias_detail.gif\">\n",
    "\n",
    "You could also summarize the results in a handy table!\n",
    "\n",
    "<img src=\"./recordings/bias_report_chart.gif\">\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you're not a Studio user yet, you can access the bias report in PDF, HTML and ipynb formats in the following S3 bucket:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bias_report_output_path"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For example, you can download a copy of the HTML report and view it in-place,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 cp {bias_report_output_path}/report.html ./bias-report.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import IPython\n",
    "\n",
    "IPython.display.HTML(filename=\"bias-report.html\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Explaining Predictions\n",
    "There are expanding business needs and legislative regulations that require explanations of _why_ a model made the decision it did. SageMaker Clarify uses Kernel SHAP to explain the contribution that each input feature makes to the final decision.\n",
    "\n",
    "For run_explainability API call we need similar `DataConfig` and `ModelConfig` objects we defined above. [SHAPConfig](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.SHAPConfig) here is the config class for Kernel SHAP algorithm."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For our demo we pass the following information in `SHAPConfig`:\n",
    "\n",
    "* `baseline`: Kernel SHAP algorithm requires a baseline (also known as background dataset). If not provided, a baseline is calculated automatically by SageMaker Clarify using K-means or K-prototypes in the input dataset. Baseline dataset type shall be the same as dataset_type, and baseline samples shall only include features. By definition, baseline should either be a S3 URI to the baseline dataset file, or an in-place list of samples. In this case we chose the latter, and put the mean of the train dataset to the list. For more details on baseline selection please [refer this documentation](https://docs.aws.amazon.com/en_us/sagemaker/latest/dg/clarify-feature-attribute-shap-baselines.html).\n",
    "* `num_samples`: Number of samples to be used in the Kernel SHAP algorithm. This number determines the size of the generated synthetic dataset to compute the SHAP values. \n",
    "* `agg_method`: Aggregation method for global SHAP values. For our example here we are using `mean_abs` i.e. mean of absolute SHAP values for all instances\n",
    "* `save_local_shap_values`: Indicates whether to save the local SHAP values in the output location. Default is True."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "baseline = [training_data.mean().iloc[1:].values.tolist()]\n",
    "shap_config = clarify.SHAPConfig(\n",
    "    baseline=baseline,\n",
    "    num_samples=15,\n",
    "    agg_method=\"mean_abs\",\n",
    "    save_local_shap_values=False,\n",
    ")\n",
    "\n",
    "explainability_output_path = \"s3://{}/{}/clarify-explainability\".format(bucket, prefix)\n",
    "explainability_data_config = clarify.DataConfig(\n",
    "    s3_data_input_path=train_uri,\n",
    "    s3_output_path=explainability_output_path,\n",
    "    label=\"Target\",\n",
    "    headers=training_data.columns.to_list(),\n",
    "    dataset_type=\"text/csv\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The job takes about 10 minutes to run\n",
    "clarify_processor.run_explainability(\n",
    "    data_config=explainability_data_config,\n",
    "    model_config=model_config,\n",
    "    explainability_config=shap_config,\n",
    "    model_scores=probability,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Viewing the Explainability Report\n",
    "As with the bias report, you can view the explainability report in Studio under the experiments tab\n",
    "\n",
    "\n",
    "<img src=\"./recordings/explainability_detail.gif\">\n",
    "\n",
    "The Model Insights tab contains direct links to the report and model insights."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you're not a Studio user yet, as with the Bias Report, you can access this report at the following S3 bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "explainability_output_path"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For example, you can download a copy of the HTML report and view it in-place,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 cp {explainability_output_path}/report.html ./explainability-report.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import IPython\n",
    "\n",
    "IPython.display.HTML(filename=\"explainability-report.html\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note:** You can run both bias and explainability jobs at the same time with `run_bias_and_explainability()`, refer [API Documentation](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.SageMakerClarifyProcessor.run_bias_and_explainability) for more details. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean Up\n",
    "Finally, don't forget to clean up the resources we set up and used for this demo!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "session.delete_model(model_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook CI Test Results\n",
    "\n",
    "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
    "\n",
    "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability_byoc.ipynb)\n",
    "\n",
    "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability_byoc.ipynb)\n",
    "\n",
    "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability_byoc.ipynb)\n",
    "\n",
    "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability_byoc.ipynb)\n",
    "\n",
    "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability_byoc.ipynb)\n",
    "\n",
    "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability_byoc.ipynb)\n",
    "\n",
    "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability_byoc.ipynb)\n",
    "\n",
    "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability_byoc.ipynb)\n",
    "\n",
    "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability_byoc.ipynb)\n",
    "\n",
    "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability_byoc.ipynb)\n",
    "\n",
    "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability_byoc.ipynb)\n",
    "\n",
    "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability_byoc.ipynb)\n",
    "\n",
    "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability_byoc.ipynb)\n",
    "\n",
    "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability_byoc.ipynb)\n",
    "\n",
    "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability_byoc.ipynb)\n"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}