{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# In this notebook, we use Supervised Machine Learning with classification to identify Fraudulent Medicare providers using data from CMS that has been preprocessed using Data Wrangler"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import required libraries (install imblearn using pip if not present)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install imblearn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np \n",
    "import pandas as pd\n",
    "import boto3\n",
    "import os\n",
    "import sagemaker\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "import io\n",
    "import sklearn\n",
    "from math import sqrt\n",
    "from sagemaker import get_execution_role\n",
    "from sagemaker import RandomCutForest\n",
    "from sagemaker.deserializers import JSONDeserializer\n",
    "from sagemaker.serializers import CSVSerializer\n",
    "from sagemaker.amazon.amazon_estimator import get_image_uri\n",
    "from sklearn.datasets import dump_svmlight_file  \n",
    "from sklearn.metrics import confusion_matrix\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import balanced_accuracy_score, cohen_kappa_score\n",
    "from sklearn.metrics import classification_report\n",
    "from imblearn.over_sampling import SMOTE\n",
    "from imblearn.under_sampling import RandomUnderSampler\n",
    "from imblearn.pipeline import Pipeline\n",
    "from sklearn.datasets import dump_svmlight_file   \n",
    "from collections import Counter\n",
    "from sagemaker.s3 import S3Downloader"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Enable the ability to see all columns and rows of data if the data size is big"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.set_option('max_columns', 200)\n",
    "pd.set_option('max_rows', 200)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "session = sagemaker.Session()\n",
    "bucket = session.default_bucket()\n",
    "prefix = 'fraud-detect-demo'\n",
    "role = get_execution_role()\n",
    "s3_client = boto3.client(\"s3\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start by reading in the entire preprocessed medicare data set prepared for classification"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!gzip -d processed_data_classification.csv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "data = pd.read_csv('processed_data_classification.csv', delimiter=',')\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Investigate and process the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check data for any nulls"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.isnull().values.any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check for imbalance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data['fraudulent_provider'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the majority of data is non-fraudulent. We will need to rebalance the data using sampling techniques that are designed specifically for imbalanced problems to improve the performance of the model.We use the Random Under Sampler and Over Sampling techniques from imblearn to do this (http://glemaitre.github.io/imbalanced-learn/api.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, remove column headers from data as SageMaker does not need headers for processing csv files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_columns = data.columns[1:]\n",
    "label_column = data.columns[0]\n",
    "\n",
    "features = data[feature_columns].values.astype('float32')\n",
    "labels = (data[label_column].values).astype('float32')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will split our dataset into a train and test to evaluate the performance of our models. Since the data is highly imbalanced, it is important to stratify across the data sets to ensure an even distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    features, labels, test_size=0.1, stratify=labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Apply SMOTE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The ratio in oversampling and the sampling strategy for undersampling are very important in improving the performance of the models. We have selected ratios based ased on research from https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0225-0 for this dataset. However, try to expirement with different ratios to see the impact"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "over = SMOTE(sampling_strategy=0.25)\n",
    "under = RandomUnderSampler(sampling_strategy=1)\n",
    "steps = [('o', over), ('u', under)]\n",
    "pipeline = Pipeline(steps=steps)\n",
    "# transform the dataset\n",
    "X_smote, y_smote = pipeline.fit_resample(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(sorted(Counter(y_smote).items()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_smote_train, X_smote_validation, y_smote_train, y_smote_validation = train_test_split(\n",
    "    X_smote, y_smote, test_size=0.1, stratify=y_smote)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training and Prediction - Supervised learning (classification)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use a supervised learning algorithm for classifcation using Amazon XGBoost "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prepare Data and Upload to S3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first copy the data to an in-memory buffer and then upload the data to S3 in libsvm format (XGBoost can take either libsvm or csv files as input)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "buf = io.BytesIO()\n",
    "\n",
    "sklearn.datasets.dump_svmlight_file(X_smote_train, y_smote_train, buf)\n",
    "buf.seek(0);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we upload the data to S3 using boto3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "key = 'fraud-dataset'\n",
    "subdir = 'base'\n",
    "boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', subdir, key)).upload_fileobj(buf)\n",
    "\n",
    "s3_train_data = 's3://{}/{}/train/{}/{}'.format(bucket, prefix, subdir, key)\n",
    "print('Uploaded training data location: {}'.format(s3_train_data))\n",
    "\n",
    "output_location = 's3://{}/{}/output'.format(bucket, prefix)\n",
    "print('Training artifacts will be uploaded to: {}'.format(output_location))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "buf = io.BytesIO()\n",
    "sklearn.datasets.dump_svmlight_file(X_smote_validation, y_smote_validation, buf)\n",
    "buf.seek(0);\n",
    "\n",
    "boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation', subdir, key)).upload_fileobj(buf)\n",
    "\n",
    "s3_validation_data = 's3://{}/{}/validation/{}/{}'.format(bucket, prefix, subdir, key)\n",
    "print('Uploaded validation data location: {}'.format(s3_validation_data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now train using SageMaker's built-in XGBoost algorithm. To specify the XGBoost algorithm, we use a utility function to obtain its URI. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "container = sagemaker.image_uris.retrieve(\"xgboost\", boto3.Session().region_name, \"latest\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SageMaker abstracts training via Estimators. We can pass the classifier and parameters along with hyperparameters to the estimator, and fit the estimator to the data in S3. An important parameter here is `scale_pos_weight` which scales the weights of the positive vs. negative class examples. This is crucial to do in an imbalanced dataset like the one we are using here, otherwise the majority class would dominate the learning.\n",
    "\n",
    "The other hyperparameters seen here were based on the results of the Hyperparameter Optimization performed using SageMaker. We describe that technique in the next section of this notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Because the data set is so highly skewed, we set the scale position weight conservatively,\n",
    "# as sqrt(num_nonfraud/num_fraud).\n",
    "# Other recommendations for the scale_pos_weight are setting it to (num_nonfraud/num_fraud).\n",
    "scale_pos_weight = sqrt(np.count_nonzero(y_train==0)/np.count_nonzero(y_train))\n",
    "hyperparams = {\n",
    "    \"max_depth\":7,\n",
    "    \"subsample\":0.8,\n",
    "    \"num_round\":145,\n",
    "    \"eta\":0.82,\n",
    "    \"gamma\":4,\n",
    "    \"min_child_weight\":41.08,\n",
    "    \"silent\":0,\n",
    "    \"objective\":'binary:logistic',\n",
    "    \"eval_metric\":'auc',\n",
    "    \"scale_pos_weight\": scale_pos_weight\n",
    "}\n",
    "\n",
    "clf = sagemaker.estimator.Estimator(container,\n",
    "                                    get_execution_role(),\n",
    "                                    hyperparameters=hyperparams,\n",
    "                                    instance_count=1, \n",
    "                                    instance_type='ml.m4.xlarge',\n",
    "                                    output_path=output_location,\n",
    "                                    sagemaker_session=session)\n",
    "\n",
    "clf.fit({'train': s3_train_data, 'validation': s3_validation_data})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Host Classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we deploy the estimator to an endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "csv_serializer = CSVSerializer()\n",
    "xgb_predictor = clf.deploy(initial_instance_count=1,\n",
    "                       instance_type='ml.m4.xlarge', \n",
    "                       serializer=csv_serializer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we have trained the model we can use it to make predictions for the test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Because we have a large test set, we call predict on smaller batches\n",
    "def predict(current_predictor, df, rows=500):\n",
    "    split_array = np.array_split(df, int(df.shape[0] / float(rows) + 1))\n",
    "    predictions = ''\n",
    "    for array in split_array:\n",
    "        predictions = ','.join([predictions, current_predictor.predict(array).decode('utf-8')])\n",
    "\n",
    "    return np.fromstring(predictions[1:], sep=',')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_preds = predict(xgb_predictor, X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use a few measures from the scikit-learn package to evaluate the performance of our model. When dealing with an imbalanced dataset, we need to choose metrics that take into account the frequency of each class in the data.\n",
    "\n",
    "We will use [balanced accuracy score](https://scikit-learn.org/stable/modules/model_evaluation.html#balanced-accuracy-score)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "we can bring a balance between the metrics again by adjusting our classification threshold (threshold between labeling a point as fraud or not). We can try different thresholds to see if they affect the result of the classification. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate balanced accuracy scores for different threshold values\n",
    "for thres in np.linspace(0.1, 0.99, num=10):\n",
    "    smote_thres_preds = np.where(raw_preds > thres, 1, 0)\n",
    "    print(\"Threshold: {:.1f}\".format(thres))\n",
    "    print(\"Balanced accuracy = {:.3f}\".format(balanced_accuracy_score(y_test, smote_thres_preds)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# use the best thresholds from the above\n",
    "y_preds = np.where(raw_preds > 0.99, 1, 0)\n",
    "print(\"Balanced accuracy = {}\".format(balanced_accuracy_score(y_test, y_preds)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Apart from single-value metrics, it's also useful to look at metrics that indicate performance per class. A confusion matrix, and per-class precision, recall and f1-score can also provide more information about the model's performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_confusion_matrix(y_true, y_predicted):\n",
    "\n",
    "    cm  = confusion_matrix(y_true, y_predicted)\n",
    "    # Get the per-class normalized value for each cell\n",
    "    cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n",
    "    \n",
    "    # We color each cell according to its normalized value, annotate with exact counts.\n",
    "    ax = sns.heatmap(cm_norm, annot=cm, fmt=\"d\")\n",
    "    ax.set(xticklabels=[\"non-fraud\", \"fraud\"], yticklabels=[\"non-fraud\", \"fraud\"])\n",
    "    ax.set_ylim([0,2])\n",
    "    plt.title('Confusion Matrix')\n",
    "    plt.ylabel('Real Classes')\n",
    "    plt.xlabel('Predicted Classes')\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_confusion_matrix(y_test, y_preds)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(classification_report(\n",
    "    y_test, y_preds, target_names=['non-fraud', 'fraud']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hyperparameter Optimization (hpo)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"Hyperparameters\" can dramtically affect the performance of the trained models.  Picking the best hyperparameter settings  is very time consuming. We will use SageMaker hyperparameter tuning to automate the  process of finding hyperparameters that can optimize model performance. Specifically, we specify a range, or a list of possible values in the case of categorical hyperparameters, for each of the hyperparameter that we plan to tune. SageMaker hyperparameter tuning will automatically launch multiple training jobs with different hyperparameter settings, evaluate results of those training jobs based on a predefined \"objective metric\", and select the hyperparameter settings for future attempts based on previous results. For each hyperparameter tuning job, we will give it a budget (max number of training jobs) and it will complete once that many training jobs have been executed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.tuner import (\n",
    "    IntegerParameter,\n",
    "    CategoricalParameter,\n",
    "    ContinuousParameter,\n",
    "    HyperparameterTuner,\n",
    ")\n",
    "\n",
    "\n",
    "hyperparameter_ranges = {\n",
    "    \"min_child_weight\": ContinuousParameter(1, 120),\n",
    "    \"max_depth\": IntegerParameter(1, 10),\n",
    "    \"num_round\":IntegerParameter(1, 500)\n",
    "}\n",
    "\n",
    "objective_metric_name = \"validation:auc\"\n",
    "\n",
    "tuner = HyperparameterTuner(\n",
    "    clf, objective_metric_name, hyperparameter_ranges, max_jobs=20, max_parallel_jobs=3\n",
    ")\n",
    "\n",
    "s3_input_train = s3_train_data\n",
    "s3_input_validation = s3_validation_data\n",
    "\n",
    "tuner.fit({\"train\": s3_input_train, \"validation\": s3_input_validation}, include_cls_metadata=False)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "smclient = boto3.client(service_name='sagemaker')\n",
    "smclient.describe_hyper_parameter_tuning_job(\n",
    "    HyperParameterTuningJobName=tuner.latest_tuning_job.job_name\n",
    ")['HyperParameterTuningJobStatus']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When the hpo job is completed, you can see the best traiing job and update the model with the best hyperparameters available in the SM console and generate predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tuner.best_training_job()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor = tuner.deploy(initial_instance_count=1,\n",
    "                     instance_type='ml.m4.xlarge', \n",
    "                       serializer=csv_serializer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_preds = predict(predictor, X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for thres in np.linspace(0.5, 0.99, num=20):\n",
    "    smote_thres_preds = np.where(raw_preds > thres, 1, 0)\n",
    "    print(\"Threshold: {:.1f}\".format(thres))\n",
    "    print(\"Balanced accuracy = {:.3f}\".format(balanced_accuracy_score(y_test, smote_thres_preds)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# scikit-learn expects 0/1 predictions, so we threshold our raw predictions\n",
    "y_preds = np.where(raw_preds > 0.95, 1, 0)\n",
    "print(\"Balanced accuracy = {}\".format(balanced_accuracy_score(y_test, y_preds)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(classification_report(\n",
    "    y_test, y_preds, target_names=['non-fraud', 'fraud']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the balanceed accuracy is improved, you can update the model with the best hyperparameters as required"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check the data for Bias"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train, test = train_test_split(data, test_size=0.1, random_state=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# converting the facet value that we wil check for bias ('female') from float to int for setting up Sagemaker clarify bias processing\n",
    "train['female'] = train['female'].astype(int)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.s3 import S3Uploader\n",
    "\n",
    "train.to_csv(\"train.csv\", index=False, header=False)\n",
    "\n",
    "train_uri = S3Uploader.upload(\"train.csv\", \"s3://{}/{}\".format(bucket, prefix))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "IMPORTANT: Since we are running XGBoost on a large number of columns, it is recommended to use an instance with high memory and increase the number of instances to the extent possible"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker import clarify\n",
    "\n",
    "clarify_processor = clarify.SageMakerClarifyProcessor(\n",
    "    role=role, instance_count=3, instance_type=\"ml.r5.24xlarge\", sagemaker_session=session\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bias_report_output_path = \"s3://{}/{}/clarify-bias\".format(bucket, prefix)\n",
    "bias_data_config = clarify.DataConfig(\n",
    "    s3_data_input_path=train_uri,\n",
    "    s3_output_path=bias_report_output_path,\n",
    "    label=\"fraudulent_provider\",\n",
    "    headers=train.columns.to_list(),\n",
    "    dataset_type=\"text/csv\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Update the `model_name` below with the `model_name` of the sagemaker endpoint you deployed. this should be available in the parameters returned from the \n",
    "`describe_endpoint_config` command in the sagemaker client smclient. To get the endpoint config you need to run the `describe_endpoint` command on the endpoint you deployed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "smclient = boto3.client(service_name='sagemaker')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "smclient.describe_endpoint_config(EndpointConfigName= 'get name from Console')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "IMPORTANT: Use the `ModelName` from the above in the `model_name` below. Also, since we are running XGBoost on a large number of columns, it is recommended to use an instance with high memory and increase the number of instances to the extent possible"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_config = clarify.ModelConfig(\n",
    "    model_name=\"UpdateModelName\",\n",
    "    instance_type=\"ml.r5.24xlarge\",\n",
    "    instance_count=3,\n",
    "    accept_type=\"text/csv\",\n",
    "    content_type=\"text/csv\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions_config = clarify.ModelPredictedLabelConfig(probability_threshold=0.75)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bias_config = clarify.BiasConfig(\n",
    "    label_values_or_threshold=[1], facet_name=\"female\", facet_values_or_threshold=[1]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clarify_processor.run_bias(\n",
    "    data_config=bias_data_config,\n",
    "    bias_config=bias_config,\n",
    "    model_config=model_config,\n",
    "    model_predicted_label_config=predictions_config,\n",
    "    pre_training_methods=\"all\",\n",
    "    post_training_methods=\"all\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bias_report_output_path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "S3Downloader.download(\"{}/report.pdf\".format(bias_report_output_path), \"../Fraud Detection/Bias/XGBoost\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To view the bias metrics, open up the bias_report.pdf - alternatively you can view results in Studio under the expirements tab"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate which features contribute to the model predictions (Explainability)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The number of samples below are critical to determine explainability. Ideally, you need to have at least 5 times the number of columns in the dataset to allow enough permutations and combinations. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_features = test.drop([\"fraudulent_provider\"], axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "shap_config = clarify.SHAPConfig(\n",
    "    num_samples=1000,\n",
    "    agg_method=\"mean_abs\",\n",
    "    save_local_shap_values=True,\n",
    ")\n",
    "\n",
    "explainability_output_path = \"s3://{}/{}/clarify-explainability\".format(bucket, prefix)\n",
    "explainability_data_config = clarify.DataConfig(\n",
    "    s3_data_input_path=train_uri,\n",
    "    s3_output_path=explainability_output_path,\n",
    "    label=\"fraudulent_provider\",\n",
    "    headers=train.columns.to_list(),\n",
    "    dataset_type=\"text/csv\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clarify_processor.run_explainability(\n",
    "    data_config=explainability_data_config,\n",
    "    model_config=model_config,\n",
    "    explainability_config=shap_config,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "explainability_output_path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "S3Downloader.download(\"{}/report.pdf\".format(explainability_output_path), \"../Fraud Detection/Exp/XGBoost\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To view the analysis on explainability, open up the exp_report.pdf. Alternately, you can view bias and explainibility reports in Studio under the experiments tab"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clean up"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Uncomment to clean up endpoints\n",
    "# xgb_predictor.delete_endpoint()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Data Acknowledgements\n",
    "\n",
    "The dataset used to demonstrated the fraud detection solution has been collected and analysed from CMS \n",
    "\n",
    "https://data.cms.gov/provider-summary-by-type-of-service/medicare-physician-other-practitioners/medicare-physician-other-practitioners-by-provider-and-service\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.m5.large",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}