{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# In this notebook, we use Deep Graph Library (DGL) based classification to identify Fraudulent Medicare providers using data from CMS that has been preprocessed using Data Wrangler"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train Graph Neural Network using DGL\n",
    "\n",
    "Graph Neural Networks work by learning representation for nodes or edges of a graph that are well suited for some downstream task. We can model the fraud detection problem as a node classification task, and the goal of the graph neural network would be to learn how to use information from the topology of the sub-graph for each transaction node to transform the node's features to a representation space where the node can be easily classified as fraud or not.\n",
    "\n",
    "Specifically, we will be using a relational graph convolutional neural network model (R-GCN) on a heterogeneous graph since we have nodes and edges of different types."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Hyperparameters\n",
    "\n",
    "To train the graph neural network, we need to define a few hyperparameters that determine properties such as the class of graph neural network models we will be using, the network architecture and the optimizer and optimization parameters. \n",
    "\n",
    "Here we're setting only a few of the hyperparameters, to see all the hyperparameters and their default values, see `dgl-fraud-detection/estimator_fns.py`. The parameters set below are:\n",
    "\n",
    "* **`nodes`** is the name of the file that contains the `node_id`s of the target nodes and the node features.\n",
    "* **`edges`** is a regular expression that when expanded lists all the filenames for the edgelists\n",
    "* **`labels`** is the name of the file tha contains the target `node_id`s and their labels\n",
    "* **`model`** specify which graph neural network to use, this should be set to `r-gcn`\n",
    "\n",
    "The following hyperparameters can be tuned and adjusted to improve model performance\n",
    "* **batch-size** is the number nodes that are used to compute a single forward pass of the GNN\n",
    "\n",
    "* **embedding-size** is the size of the embedding dimension for non target nodes\n",
    "* **n-neighbors** is the number of neighbours to sample for each target node during graph sampling for mini-batch training\n",
    "* **n-layers** is the number of GNN layers in the model\n",
    "* **n-epochs** is the number of training epochs for the model training job\n",
    "* **optimizer** is the optimization algorithm used for gradient based parameter updates\n",
    "* **lr** is the learning rate for parameter updates\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install imblearn\n",
    "!pip install igraph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np \n",
    "import pandas as pd\n",
    "import boto3\n",
    "import os\n",
    "import sagemaker\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "import io\n",
    "import sklearn\n",
    "from math import sqrt\n",
    "from sagemaker import get_execution_role\n",
    "from sagemaker import RandomCutForest\n",
    "from sagemaker.deserializers import JSONDeserializer\n",
    "from sagemaker.serializers import CSVSerializer\n",
    "from sagemaker.amazon.amazon_estimator import get_image_uri\n",
    "from sklearn.datasets import dump_svmlight_file  \n",
    "from sklearn.metrics import confusion_matrix\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import balanced_accuracy_score, cohen_kappa_score\n",
    "from sklearn.metrics import classification_report\n",
    "from imblearn.over_sampling import SMOTE\n",
    "from imblearn.under_sampling import RandomUnderSampler\n",
    "from imblearn.pipeline import Pipeline\n",
    "from sklearn.datasets import dump_svmlight_file   \n",
    "from collections import Counter\n",
    "import networkx as nx\n",
    "from igraph import *\n",
    "import json\n",
    "from sagemaker_graph_fraud_detection import config\n",
    "import sys\n",
    "from os import path\n",
    "from sagemaker.s3 import S3Downloader "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.set_option('max_columns', 200)\n",
    "pd.set_option('max_rows', 200)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "session = sagemaker.Session()\n",
    "bucket = session.default_bucket()\n",
    "prefix = 'fraud-detect-demo/graph'\n",
    "role = get_execution_role()\n",
    "s3_client = boto3.client(\"s3\")\n",
    "sys.path\n",
    "sys.path.append('./sagemaker_graph_fraud_detection/')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Data Preparation**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start by reading in the entire preprocessed medicare data set prepared for classification"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data=pd.read_csv('features-with-headers-rus.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data['Fraud'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For DGL, we need to identify the nodes that are used for training, validation and testing (these are called masks)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data_ratio = 0.7\n",
    "valid_data_ratio = 0.2\n",
    "n_train = int(data.shape[0]*train_data_ratio)\n",
    "n_valid = int(data.shape[0]*(train_data_ratio+valid_data_ratio))\n",
    "valid_ids = data.NPI.values[n_train:n_valid]\n",
    "test_ids = data.NPI.values[n_valid:]\n",
    "test_df = data[n_valid:][['NPI','Fraud']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save test and validation masks as files - the remaining nodes will be used as the training mask"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('validation.csv', 'w') as f:\n",
    "        f.writelines(map(lambda x: str(x) + \"\\n\", valid_ids))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('test.csv', 'w') as f:\n",
    "        f.writelines(map(lambda x: str(x) + \"\\n\", test_ids))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Upload all files needed for training to S3 - these include the edges (relation*.csv), the nodes along with the label (tags.csv), the rest of features of each node (features.csv), and the test and validation ids"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_client.upload_file('relation_NPI_drug.csv',bucket,'fraud-detect-demo/graph/relation_NPI_drug.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_client.upload_file('relation_NPI_HCPCS.csv',bucket,'fraud-detect-demo/graph/relation_NPI_HCPCS.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_client.upload_file('relation_NPI.csv',bucket,'fraud-detect-demo/graph/relation_NPI.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_client.upload_file('tags.csv',bucket,'fraud-detect-demo/graph/tags.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_client.upload_file('test.csv',bucket,'fraud-detect-demo/graph/test.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_client.upload_file('validation.csv',bucket,'fraud-detect-demo/graph/validation.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_client.upload_file('features.csv',bucket,'fraud-detect-demo/graph/features.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Now we can begin the prediction of fraud for the test ids using DGL**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Specify the location of the uploaded training data in S3 and the folder to store the model and the output of the predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data = 'replace with your input S3 uri'\n",
    "train_output = 'replace with your output S3 uri'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "specify the various files that need to be processed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "processed_files = S3Downloader.list(train_data)\n",
    "print(\"===== Processed Files =====\")\n",
    "print('\\n'.join(processed_files))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Setup the edges to create the graph from the provider relation files and the parameters to train the model "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "edges = \",\".join(map(lambda x: x.split(\"/\")[-1], [file for file in processed_files if \"relation\" in file]))\n",
    "params = {'nodes' : 'features.csv',\n",
    "          'edges': 'relation*',\n",
    "          'labels': 'tags.csv',\n",
    "          'model': 'rgcn',\n",
    "          'num-gpus': 1,\n",
    "          'batch-size': 1024,\n",
    "          'embedding-size': 1024,\n",
    "          'n-neighbors': 100,\n",
    "          'n-layers': 2,\n",
    "          'n-epochs': 30,\n",
    "          'optimizer': 'adam',\n",
    "          'lr': 1e-2\n",
    "        }\n",
    "\n",
    "print(\"Graph will be constructed using the following edgelists:\\n{}\" .format('\\n'.join(edges.split(\",\"))))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create and Fit SageMaker Estimator\n",
    "\n",
    "With the hyperparameters defined, we can kick off the training job. We will be using the Deep Graph Library (DGL), with MXNet as the backend deep learning framework, to define and train the graph neural network. Amazon SageMaker makes it do this with the Framework estimators which have the deep learning frameworks already setup. Here, we create a SageMaker MXNet estimator and pass in our model training script, hyperparameters, as well as the number and type of training instances we want.\n",
    "\n",
    "We can then `fit` the estimator on the the training data location in S3."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Hyperparameters\n",
    "\n",
    "To train the graph neural network, we need to define a few hyperparameters that determine properties such as the class of graph neural network models we will be using, the network architecture and the optimizer and optimization parameters. \n",
    "\n",
    "Here we're setting only a few of the hyperparameters, to see all the hyperparameters and their default values, see `dgl-fraud-detection/estimator_fns.py`. The parameters set below are:\n",
    "\n",
    "* **`nodes`** is the name of the file that contains the `node_id`s of the target nodes and the node features.\n",
    "* **`edges`** is a regular expression that when expanded lists all the filenames for the edgelists\n",
    "* **`labels`** is the name of the file tha contains the target `node_id`s and their labels\n",
    "* **`model`** specify which graph neural network to use, this should be set to `r-gcn`\n",
    "\n",
    "The following hyperparameters can be tuned and adjusted to improve model performance\n",
    "* **batch-size** is the number nodes that are used to compute a single forward pass of the GNN\n",
    "\n",
    "* **embedding-size** is the size of the embedding dimension for non target nodes\n",
    "* **n-neighbors** is the number of neighbours to sample for each target node during graph sampling for mini-batch training\n",
    "* **n-layers** is the number of GNN layers in the model\n",
    "* **n-epochs** is the number of training epochs for the model training job\n",
    "* **optimizer** is the optimization algorithm used for gradient based parameter updates\n",
    "* **lr** is the learning rate for parameter updates\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from sagemaker.mxnet import MXNet\n",
    "from time import strftime, gmtime\n",
    "\n",
    "estimator = MXNet(\n",
    "    entry_point='train_dgl_mxnet_entry_point.py',\n",
    "    source_dir='sagemaker_graph_fraud_detection/dgl_fraud_detection',\n",
    "    role=role, \n",
    "    instance_count=1, \n",
    "    instance_type='ml.g4dn.xlarge',\n",
    "    framework_version=\"1.6.0\",\n",
    "    py_version='py3',\n",
    "    hyperparameters=params,\n",
    "    output_path=train_output,\n",
    "    code_location=train_output,\n",
    "    sagemaker_session=session,\n",
    ")\n",
    "\n",
    "training_job_name = \"{}-{}\".format('dgl-classification', strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime()))\n",
    "print(\n",
    "    f\"You can go to SageMaker -> Training -> Hyperparameter tuning jobs -> a job name started with {training_job_name} to monitor training job status and details.\"\n",
    ")\n",
    "estimator.fit({'train': train_data}, job_name=training_job_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the training is completed, the training instances are automatically stopped and SageMaker stores the trained model and evaluation results (on the test data) to a location in S3."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read the prediction output for the test data\n",
    "Current training process is transductive setting where the predicting columns of test dataset (not including the target column) are used to construct the graph and thus the test data are included in the training process. At the end of training, the predictions on the test dataset are generated and saved in the **train_output** in the s3 bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_output_path = os.path.join(train_output, estimator.latest_training_job.job_name, \"output\")\n",
    "!mkdir -p output_dgl_job\n",
    "!aws s3 cp --recursive $test_output_path output_dgl_job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tarfile\n",
    "  \n",
    "# open file\n",
    "tar = tarfile.open(os.path.join(\"output_dgl_job\", \"output.tar.gz\"), \"r:gz\")\n",
    "tar.extractall(\"output_dgl_job\")\n",
    "tar.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dgl_output = pd.read_csv(os.path.join(\"output_dgl_job\", \"preds.csv\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_test = test_df['Fraud'].values.astype('float32')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_preds = dgl_output.pred.values.astype('float32')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_confusion_matrix(y_true, y_predicted):\n",
    "\n",
    "    cm  = confusion_matrix(y_true, y_predicted)\n",
    "    # Get the per-class normalized value for each cell\n",
    "    cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n",
    "    \n",
    "    # We color each cell according to its normalized value, annotate with exact counts.\n",
    "    ax = sns.heatmap(cm_norm, annot=cm, fmt=\"d\")\n",
    "    ax.set(xticklabels=[\"non-fraud\", \"fraud\"], yticklabels=[\"non-fraud\", \"fraud\"])\n",
    "    ax.set_ylim([0,2])\n",
    "    plt.title('Confusion Matrix')\n",
    "    plt.ylabel('Real Classes')\n",
    "    plt.xlabel('Predicted Classes')\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Balanced accuracy = {:.3f}\".format(balanced_accuracy_score(y_test, y_preds)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_confusion_matrix(y_test, y_preds)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(classification_report(\n",
    "    y_test, y_preds, target_names=['non-fraud', 'fraud']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create and Fit SageMaker Estimator with HPO\n",
    "In this section we fit the SageMaker Estimator using DGL with HPO."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.tuner import (\n",
    "    IntegerParameter,\n",
    "    CategoricalParameter,\n",
    "    ContinuousParameter,\n",
    "    HyperparameterTuner,\n",
    ")\n",
    "\n",
    "# Static hyperparameters we do not tune\n",
    "hyperparameters = {\n",
    "    'nodes' : 'features.csv',\n",
    "    'edges': 'relation*',\n",
    "    'labels': 'tags.csv',\n",
    "    'model': 'rgcn',\n",
    "    'num-gpus': 1,\n",
    "    'n-layers': 2,\n",
    "    'optimizer': 'adam',\n",
    "}\n",
    "\n",
    "# Dynamic hyperparameters we want to tune and their searching ranges. For demonstartion purpose, we skip the architecture search by skipping tunning the hyperparameters such as 'skip_rnn_num_layers', 'rnn_num_layers', and etc.\n",
    "hyperparameter_ranges = {\n",
    "    'batch-size': CategoricalParameter([512, 1024, 2048, 10000]),\n",
    "    'embedding-size': CategoricalParameter([16, 32, 64, 128, 256, 512]),\n",
    "    'n-neighbors': IntegerParameter(800, 1200),\n",
    "    'n-epochs': IntegerParameter(10, 17),\n",
    "    'lr': ContinuousParameter(0.002, 0.1),\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "objective_metric_name = \"Validation F1\"\n",
    "metric_definitions = [{\"Name\": \"Validation F1\", \"Regex\": \"Validation F1 (\\\\S+)\"}] #Root Relative Squared Error (RSE): \n",
    "objective_type = \"Maximize\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.mxnet import MXNet\n",
    "\n",
    "estimator_tuning = MXNet(\n",
    "    entry_point='train_dgl_mxnet_entry_point.py',\n",
    "    source_dir='sagemaker_graph_fraud_detection/dgl_fraud_detection',\n",
    "    role=role, \n",
    "    instance_count=1, \n",
    "    instance_type='ml.g4dn.xlarge',\n",
    "    framework_version=\"1.6.0\",\n",
    "    py_version='py3',\n",
    "    hyperparameters=params,\n",
    "    output_path=train_output,\n",
    "    code_location=train_output,\n",
    "    sagemaker_session=session,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "\n",
    "tuning_job_name = \"{}-{}\".format('dgl-classification-tuning', strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime()))\n",
    "print(\n",
    "    f\"You can go to SageMaker -> Training -> Hyperparameter tuning jobs -> a job name started with {tuning_job_name} to monitor HPO tuning status and details.\\n\"\n",
    "    f\"Note. You will be unable to successfully run the following cells until the tuning job completes. This step may take around 2 hour.\"\n",
    ")\n",
    "\n",
    "tuner = HyperparameterTuner(\n",
    "    estimator_tuning,  # using the estimator defined in previous section\n",
    "    objective_metric_name,\n",
    "    hyperparameter_ranges,\n",
    "    metric_definitions,\n",
    "    max_jobs=20,\n",
    "    max_parallel_jobs=2,\n",
    "    objective_type=objective_type,\n",
    "    base_tuning_job_name = tuning_job_name,\n",
    ")\n",
    "\n",
    "start_time = time.time()\n",
    "\n",
    "tuner.fit({'train': train_data})\n",
    "\n",
    "hpo_training_job_time_duration = time.time() - start_time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "sm_client = boto3.Session().client(\"sagemaker\")\n",
    "\n",
    "tuning_job_name = tuner.latest_tuning_job.name\n",
    "tuning_job_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tuning_job_result = sm_client.describe_hyper_parameter_tuning_job(\n",
    "    HyperParameterTuningJobName=tuning_job_name\n",
    ")\n",
    "\n",
    "status = tuning_job_result[\"HyperParameterTuningJobStatus\"]\n",
    "if status != \"Completed\":\n",
    "    print(\"Reminder: the tuning job has not been completed.\")\n",
    "\n",
    "job_count = tuning_job_result[\"TrainingJobStatusCounters\"][\"Completed\"]\n",
    "print(\"%d training jobs have completed\" % job_count)\n",
    "\n",
    "is_minimize = (\n",
    "    tuning_job_result[\"HyperParameterTuningJobConfig\"][\"HyperParameterTuningJobObjective\"][\"Type\"]\n",
    "    != \"Minimize\"\n",
    ")\n",
    "objective_name = tuning_job_result[\"HyperParameterTuningJobConfig\"][\n",
    "    \"HyperParameterTuningJobObjective\"\n",
    "][\"MetricName\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tuner_analytics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)\n",
    "\n",
    "full_df = tuner_analytics.dataframe()\n",
    "\n",
    "if len(full_df) > 0:\n",
    "    df = full_df[full_df[\"FinalObjectiveValue\"] > -float(\"inf\")]\n",
    "    if len(df) > 0:\n",
    "        df = df.sort_values(\"FinalObjectiveValue\", ascending=False)\n",
    "        print(\"Number of training jobs with valid objective: %d\" % len(df))\n",
    "        print({\"lowest\": min(df[\"FinalObjectiveValue\"]), \"highest\": max(df[\"FinalObjectiveValue\"])})\n",
    "        pd.set_option(\"display.max_colwidth\", -1)  # Don't truncate TrainingJobName\n",
    "    else:\n",
    "        print(\"No training jobs have reported valid results yet.\")\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read the prediction output for the test dataset from the best tuning job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "df = df[df[\"TrainingJobStatus\"] == \"Completed\"] # filter out the failed jobs\n",
    "output_path_best_tuning_job = os.path.join(train_output, df[\"TrainingJobName\"].iloc[0], \"output\")\n",
    "print(output_path_best_tuning_job)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p output_dgl_best_tuning_job\n",
    "!aws s3 cp --recursive $output_path_best_tuning_job output_dgl_best_tuning_job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tarfile\n",
    "  \n",
    "# open file\n",
    "tar = tarfile.open(os.path.join(\"output_dgl_best_tuning_job\", \"output.tar.gz\"), \"r:gz\")\n",
    "tar.extractall(\"output_dgl_best_tuning_job\")\n",
    "tar.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dgl_output = pd.read_csv(os.path.join(\"output_dgl_best_tuning_job\", \"preds.csv\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_preds = dgl_output.pred.values.astype('float32')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_confusion_matrix(y_test, y_preds)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(classification_report(\n",
    "    y_test, y_preds, target_names=['non-fraud', 'fraud']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Balanced accuracy = {:.3f}\".format(balanced_accuracy_score(y_test, y_preds)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clean Up\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After you are done using this notebook, delete the model artifacts and other resources to avoid any incurring charges.\n",
    "\n",
    "**Caution**: You need to manually delete resources that you may have created while running the notebook, such as Amazon S3 buckets for model artifacts, training datasets, processing artifacts, and Amazon CloudWatch log groups.\n"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.g4dn.xlarge",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}