# 3. Building an XGBoost model to classify electronic medical records (EMR) using Sagemaker

In the previous module of the workshop, we prepared our dataset by extracting medical records that had a medical speciality of *`Surgery`* and *`Consultation`*. In this module, we will be using this dataset to create a classification model that will look at the transcription and categorize them into the following two categories - undergoing *`Consultation`* only or refer to a specialist for *`Sugery`*.

The goal of this experiment is to do a **Next step Prediction** which aims at predicting the speciality needed for a patient with certain diseases. In practice, the model could be used to analyze a medical transcription in real-time and provide a recommended referals to respective specialist. 

The input is the EMR as a raw text file with doctor's notes about the patient, including his/her age, compaints described in free way, patient's history and so on. It is unstructured - different sections of oen patient anamnesis may abscent in another's.

The value on the solution might be found in helping a doctor to find the optimal solution for diasnostics order. Patient can save time and money, and doctor can serve a patient more efficiently on sparing time for unnecessary diagnostics. Moreover, in difficult cases the algorithm may help a doctor to find a diagnosys faster, which in some cases may be extremely valuable, up to saving lives.

Theoretically some regularities found by the algorithm may help medical researchers to find the idea of treating some deseases, based on their unobvious interconnections with some symptoms.

---

## Contents

1. [Objective](#Objective)
1. [Setup Environment](#Setup-Environment)
1. [Load and Explore the Dataset](#Load-and-Explore-Dataset)
1. [Prepare Dataset for Model Training](#Prepare-Dataset-for-Model-Training)
1. [Understand the Algorithm](#Understand-the-Algorithm)
1. [Train the Model](#Train-the-Model)
1. [Deploy and Evaluate the Model](#Deploy-and-Evaluate-the-Model)
1. [Challenges](#Challenges)
1. [Hyperparameter Optimization](#Hyperparameter-Optimization)
1. [Bonus Activtity](#Bonus-Activtity)
1. [Conclusion](#Conclusion)
1. [Clean up resources](#Clean-up-resources)



---

## Objective 

In this section of the workshop we will walkthrough an example of using Sagemaker to train a machine learning model to classify medical documents using a SageMaker's built-in algorithm- XGboost. 
 
**Bonus activity**: The workshop will also include a bonus activity whereby you will learn how to use SageMaker's automatic Hyperparameter Optimization (HPO) capabilities to improve the model's performance with needing to tune the hyperparameters manually.

**Note**: Teaching in-depth data science approaches for tabular data is outside this scope, however we recommend that you use this notebook as a reference to adapt to your needs for future projects.

---
## Setup Environment

Before we can begin, we will need to setup up notebook environment by performing some of the following environment setup:

- **import** some useful libraries (as in any Python notebook)
- **configure** the S3 bucket and folder where data should be stored (to keep our environment tidy)
- **connect** to AWS in general (with [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)) and SageMaker in particular (with the [sagemaker SDK](https://sagemaker.readthedocs.io/en/stable/)), to use the cloud services
- **Upgrad** SageMaker to the latest version

While `boto3` is the general AWS SDK for Python, `sagemaker` provides some powerful, higher-level interfaces designed specifically for ML workflows.

In [None]:
pip install --upgrade sagemaker

In [None]:
import numpy as np  # For matrix operations and numerical processing
import pandas as pd  # For munging tabular data
pd.set_option('display.max_colwidth', None)

import time
import os

# import self-defined functions
from util.classification_report import generate_classification_report, predict_from_numpy_V2  # helper function for classification reports
from util.Pipeline import extractTextract, extractMedical
from util.preprocess import *

# setting up SageMaker parameters
import pkg_resources
pkg_resources.require("sagemaker>2.9.2") 
import sagemaker
import boto3

import matplotlib.pyplot as plt
import seaborn as sns

boto_session = boto3.Session()
region = boto_session.region_name
bucket_name = sagemaker.Session().default_bucket()
bucket_prefix = "emr-mtSample"  # Location in the bucket to store our files
sgmk_session = sagemaker.Session()
sgmk_client = boto_session.client("sagemaker")
sgmk_role = sagemaker.get_execution_role()

---
## Load and Explore Dataset

To begin, we will load the dataset from our previous notebook [2.Batch_Data_Processing](./2.Batch_Data_Processing.ipynb). This dataset contains labelled data based on the medical speciality - Surgery or not Surgery and the medical conditions that were extracted from the electronic medical reports.
You can find the processed dataset in the following location '/data/processed_combined_wide.csv'.

**Note**: The original raw dataset that this workshop is based on is available at [kaggle](https://www.kaggle.com/tboyle10/medicaltranscriptions).

*Demographics:*
* `ID`: id of the patients (int)
* `Label`: the patient needs surgery? (bool)
* `nontender`: medical condition extracted from doctos'notes. the number indicate confidence of the symptom (float)
* the rest columns are other medical condistions ...

In [None]:
df_wide=pd.read_csv("./data/processed_combined_wide.csv")
df_wide.head()

#### Explore correlation between the input variables and output one

Let us start by looking at the correlation between the input features and our label (Surgry/Not Surgery).


In [None]:
corrPlot(df_wide)

#### Observations
+ Observation 1: *`wound`* and *`hypertension`* is postively and negatively correlated with Surgery. 
+ Observation 2: *`nausea` and *`vomitting`* is positively correlated
+ Did you observe any more from the plot?

---
## Prepare-Dataset-for-Model-Training

Next, we will start training the model. But before we can proceeed, we need to:

1. Suffle and split the data into **Training (80%)**, **Validation (10%)**, and **Test (10%)** sets
2. Convert the data to the format the algorithm expects (e.g. CSV)
3. Upload the data to S3
4. Create `s3_input` objects defining the data sources for the SageMaker SDK

The training and validation datasets will be used during the training (and tuning) phase, while the 'holdout' test set will be used afterwards to evaluate the model.

Please note that to train the SageMaker XGBoost algorithm, it expects data in the **libSVM** or **CSV** formats with the following format:

- The target variable in the first column, and
- No header row

You can find more information about this [here](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html#InputOutput-XGBoost).

In [None]:
# remove the id column 
df_combined_model=df_wide.iloc[:,1:] 
# Shuffle and splitting dataset
train_data, validation_data, test_data = np.split(df_combined_model.sample(frac=1, random_state=123), 
                                                  [int(0.8 * len(df_combined_model)), int(0.9*len(df_combined_model))],) 

# Create CSV files for Train / Validation / Test
train_data.to_csv("data/train.csv", index=False, header=False)
validation_data.to_csv("data/validation.csv", index=False, header=False)
test_data.to_csv("data/test.csv", index=False, header=True)

### Upload dataset to S3

In [None]:
# Upload CSV files to S3 for SageMaker training
train_uri = sgmk_session.upload_data(
    path="data/train.csv",
    bucket=bucket_name,
    key_prefix=bucket_prefix
)
val_uri = sgmk_session.upload_data(
    path="data/validation.csv",
    bucket=bucket_name,
    key_prefix=bucket_prefix
)


# Create s3_inputs

s3_input_train = sagemaker.TrainingInput(s3_data=train_uri, content_type="csv")
s3_input_validation = sagemaker.TrainingInput(s3_data=val_uri, content_type="csv")

print(f"{s3_input_train.config}\n\n{s3_input_validation.config}")

## Understand the Algorithm
**`XGBoost`** stands for e**X**treme **G**radient **Boosting**. It implements the gradient boosting decision tree algorithm, which is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models. Check detailed documentation for built-in *XGBoost* [here](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html).

*"The name xgboost, though, actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms. Which is the reason why many people use xgboost."* -- Tianqi Chen, Creator of *XGBoost*

The two major advantages of using XGBoost are:

    1. Fast Execution Speed: Generally, XGBoost is faster when compared to other implementations of gradient boosting.
    2. High Model Performance: XGBoost has exceled in either structured or tabular datasets on classification and regression predictive modeling problems.


In [None]:
from sagemaker import image_uris 
from sagemaker.image_uris import retrieve

training_image = retrieve("xgboost",region=region,  version="1.0-1")

print(training_image)

### Define Hyperparameters & Algorithm
In the following step, we will use the [sagemaker.estimator.Estimator()](https://sagemaker.readthedocs.io/en/v1.72.0/api/training/estimators.html) function to configure the following:

* image_name - Training image to use(image_name), in this case we will be using the xgboost training image
* train_instance_type - Type of instance to use.
* train_instance_count - The number of instances to run the training job. For suitable algorithms that support distributed training, set an instance count of more than 1.
* role - IAM role used to run the training job
* train_use_spot_instances - Specify whether to use spot instances. For more information about spot training, refer to the following url: https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html
* train_max_run - Timeout in seconds for training (default: 24 * 60 * 60). After this amount of time Amazon SageMaker terminates the job regardless of its current status.
* train_max_wait - Timeout in seconds waiting for spot training instances
* hyperparameters - Our hyperparameters used to train the model

For more information about the available xgboost hyperparamters, please refer to the following documentation [here](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html).

In [None]:
hyperparameters = {
    "num_round": "150",     # int: [1,300]
    "max_depth": "5",     # int: [1,10]
    "alpha": "2.5",         # float: [0,5]
    "eta": "0.5",           # float: [0,1]
    "objective": "binary:logistic",
}

# Instantiate an XGBoost estimator object
estimator = sagemaker.estimator.Estimator(
    image_uri=training_image,           # XGBoost algorithm container
    instance_type="ml.m5.xlarge",  # type of training instance
    instance_count=1,              # number of instances to be used
    role=sgmk_role,                      # IAM role to be used
    use_spot_instances=True,       # Use spot instances to reduce cost
    max_run=20*60,                 # Maximum allowed active runtime
    max_wait=30*60,                # Maximum clock time (including spot delays)
    hyperparameters=hyperparameters
)

---
## Train-the-Model

To start the training job, we will call the `estimator.fit()` function. This will start a Sagemaker training job in the background. You can also see your training job within the AWS console by going to Sagemaker -> Training jobs.

Once the training job is completed, proceed to the next step.

In [None]:
# start a training (fitting) job
estimator.fit({ "train": s3_input_train, "validation": s3_input_validation })

## Deploy and Evaluate the Model
Now that we've trained our xgboost model, let us proceed with deploying our model (hosting it behind a real-time endpoint) so that we can start running predictions in real-time. This can be done using the `estimator.deploy()` function. You can find more information about model deployment here - https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html.

For the input to the deployment function, we will specify the following:

* initial_instance_count - Minimum number of EC2 instances to deploy to an endpoint for prediction.
* instance_type - Type of EC2 instance to deploy to an endpoint for prediction, for example, ‘ml.c4.xlarge’.

This deployment might take up to 8 minutes, and by default the code will wait for the deployment to complete.
If you like, you can instead:
+ Un-comment the wait=False parameter
+ Use the Endpoints page of the SageMaker Console to check the status of the deployment
+ Skip over the Evaluation section below (which won't run until the deployment is complete), and start the Hyperparameter Optimization job - which will take a while to run too, so can be started in parallel

In [None]:
predictor = estimator.deploy(
    #endpoint_name=auto_ml_job_name,
    initial_instance_count=1,
    instance_type="ml.m5.large",
    #inference_response_keys=inference_response_keys,
    predictor_cls=sagemaker.predictor.Predictor,
    #serializer = sagemaker.serializers.CSVSerializer()
    #wait=False
)

## Run predictions

Once the Sagemaker endpoint has been deployed, we can now run some prediction to test our endpoint. Let us test our endpoint by running some predictions on our test data and evaluating the results.

In [None]:
# (This cell will error, until the endpoint deployment above is finished!)

# Get predictions for the test set:
predictions = predict_from_numpy_V2(predictor, test_data.drop(["Label"], axis=1))

print(predictions)

In [None]:
from imp import reload
from util import classification_report
reload(classification_report)

predictions = predict_from_numpy_V2(predictor, test_data.drop(["Label"], axis=1))

In [None]:
# helper function for classification reports

generate_classification_report(
    y_real=test_data['Label'].values, 
    y_predict_proba=predictions, 
    decision_threshold=0.5,
    class_names_list=["Negative", "Positive"],
    title="Initial model",
)

---
## Challenges

### Challenge 2a: Change the Hyperparameter 
Now that we've trained our model using some pre-defined hyperparamters, let us try and see if we can improve the model performance by mannually change the HyperParameter. For example, you can change the number of `num_round`, `max_depth` and the rest of the hyperparameter that we plan to tune.



In [None]:
##!! TO DO: You are expected to fill the hyperparameters bellow and Run the Model training

hyperparameters2 = {
    "num_round": "",  ## TO WRITE   
    "max_depth": "",  ## TO WRITE   
    "alpha": "",      ## TO WRITE  
    "eta": "",        ## TO WRITE   
    "objective": "binary:logistic",
}

# Instantiate an XGBoost estimator object
estimator2 = sagemaker.estimator.Estimator(
    image_uri=training_image,           # XGBoost algorithm container
    instance_type="ml.m5.xlarge",  # type of training instance
    instance_count=1,              # number of instances to be used
    role=sgmk_role,                      # IAM role to be used
    use_spot_instances=True,       # Use spot instances to reduce cost
    max_run=20*60,                 # Maximum allowed active runtime
    max_wait=30*60,                # Maximum clock time (including spot delays)
    hyperparameters=hyperparameters2
)
estimator2.fit({ "train": s3_input_train, "validation": s3_input_validation }) ## model fitting 
estimator2.create_model() ## create a new model from the training job

### Create model from the training job

After the training job is done, the model is not saved yet. Check training jobs and models in your SageMaker Console. To create a model from a training job, refer to the documentation for  *[create_model API](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model)*

In [None]:
##!! TO DO: get the model artifact information

## hint 1: create a primary container with the trained model
## hint 2: check the API estimator2.create_model().model_data 

In [None]:
##!! TO DO: create a model from the container

## hint 3: estimator2.create_model().model_data 

### Stucked?
Feel free to check the [video/solution in Step 3](https://www.aiml-loft.wwps.aws.dev/workshops/module-medical-document-processing-and-classification/step3/),  or reach out to any of us. 

### Challenge 2b: Update the endpoint 

Now, we've trained 2 models using pre-defined hyperparameter and deployed the first model. For the second model, instead of deploying a new model, can you update the endpoint with the newly built model? 

Hint 1: create a model with API *create_model*: https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-deploy-model.html 

The API that you are going to use *update_endpoint*: https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html#sagemaker.predictor.Predictor.update_endpoint


In [None]:
%%time

###########################################################################
## !! TO DO: WRITE YOUR CODE HERE TO UPDATE THE END POINT ########################
###########################################################################

In [None]:
#predictions = predict_from_numpy(predictor, test_data.drop(["Label"], axis=1))
predictions = predict_from_numpy_V2(predictor
                                    , test_data.drop(["Label"], axis=1))
generate_classification_report(
    y_real=test_data['Label'].values, 
    y_predict_proba=predictions, 
    decision_threshold=0.5,
    class_names_list=["Negative", "Positive"],
    title="updated model",
)

---
## Hyperparameter Optimization

Now that we've trained our model using some pre-defined hyperparamters, let us try and see if we can improve the model performance by using SageMaker HyperParameter Optimization (HPO) by automating the search for an optimal hyperparameter. Specifically, we **specify a range**, or a list of possible values in the case of categorical hyperparameters, for each of the hyperparameter that we plan to tune.

SageMaker hyperparameter tuning will automatically launch **multiple training jobs** with different hyperparameter settings, evaluate results of those training jobs based on a predefined "objective metric", and select the hyperparameter settings for future attempts based on previous results. For each hyperparameter tuning job, we will specify the maximum number of HPO tries (`max_jobs`) and how many of these can happen in parallel (`max_parallel_jobs`).

Tip: `max_parallel_jobs` creates a **trade-off between performance and speed** (better hyperparameter values vs how long it takes to find these values). If `max_parallel_jobs` is large, then HPO is faster, but the discovered values may not be optimal. Smaller `max_parallel_jobs` will increase the chance of finding optimal values, but HPO will take more time to finish.

Next we'll specify the objective metric that we'd like to tune and its definition, which includes the regular expression (Regex) needed to extract that metric from the CloudWatch logs of the training job. Since we are using built-in XGBoost algorithm here, it emits two predefined metrics: **validation:auc** and **train:auc**, and we elected to monitor *validation:auc* as you can see below. In this case (because it's pre-built for us), we only need to specify the metric name. For more information on parameter tuning of XGboost, please refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost-tuning.html) 

Depending on the number of tries, HPO can find a better performing model faster, compared to simply trying different hyperparameters by trial and error or grid search. You can learn more in-depth details about SageMaker HPO [here](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html).

For more information on Sagemaker HPO please refer to the documentation [here](https://sagemaker.readthedocs.io/en/stable/tuner.html).

**Note:** with the default settings below, the hyperparameter tuning job can take up to ~20 minutes to complete.

In [None]:
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

# set up hyperparameter ranges
ranges = {
    "num_round": IntegerParameter(100, 300),
    "max_depth": IntegerParameter(1, 10),
    "alpha": ContinuousParameter(0, 5),
    "eta": ContinuousParameter(0, 1),
}

# set up the objective metric
objective = "validation:auc"
#objective = "validation:accuracy"
# instantiate a HPO object
tuner = HyperparameterTuner(
    estimator=estimator,              # the SageMaker estimator object
    hyperparameter_ranges=ranges,     # the range of hyperparameters
    max_jobs=10,                      # total number of HPO jobs
    max_parallel_jobs=2,              # how many HPO jobs can run in parallel
    strategy="Bayesian",              # the internal optimization strategy of HPO
    objective_metric_name=objective,  # the objective metric to be used for HPO
    objective_type="Maximize",        # maximize or minimize the objective metric
)  


In [None]:
%%time
# start HPO
tuner.fit({ "train": s3_input_train, "validation": s3_input_validation })

In [None]:
%%time
# wait, until HPO is finished
hpo_state = "InProgress"

while hpo_state == "InProgress":
    hpo_state = sgmk_client.describe_hyper_parameter_tuning_job(
                HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)["HyperParameterTuningJobStatus"]
    print("-", end="")
    time.sleep(60)  # poll once every 1 min

print("\nHPO state:", hpo_state)



In [None]:
%%time
# deploy the best model from HPO
hpo_predictor = tuner.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge",predictor_cls=sagemaker.predictor.Predictor,
    serializer = sagemaker.serializers.CSVSerializer())

In [None]:
hpo_predictor.deserializer=sagemaker.deserializers.CSVDeserializer()

In [None]:
# getting the predicted probabilities of the best model
hpo_predictions = predict_from_numpy_V2(hpo_predictor, test_data.drop(["Label"], axis=1))
print(hpo_predictions)

# generate report for the best model
generate_classification_report(
    y_real=test_data["Label"].values, 
    y_predict_proba=hpo_predictions, 
    decision_threshold=0.5,
    class_names_list=["Consultation","Surgery"],
    title="Best model (with HPO)",
)

---
---
##  Bonus Activtity

### A simplified pipeline to process an Electronic Health Record
Here, we will combine Textract, Comprehend Medical and SageMaker endpoint to process an electronic medical resport. 

In [None]:
from imp import reload
from util.Pipeline import extractTextract, extractMedical

### Step 1: Extract data from Textract

In [None]:
PDFprefix = 'sagemaker/medical_notes'
fileName =  'sample_report_1.pdf'
#fileUploadPath = os.path.join('./data', fileName)
textractObjectName = os.path.join(PDFprefix, 'data', fileName)
print("EHR file to be processed is at ", textractObjectName)

doc=extractTextract(bucket_name,textractObjectName)

### Step 2: Extract data from Comprehend Medical

In [None]:
comprehendResponse=extractMedical(doc)
df_cm=extractMC_v2(comprehendResponse[0])

### Step 3: Organize the extracted json file into dataframe

In [None]:
mclist, df_cm2=retrieve_mcList(df_cm, nFeature=20,threshold=0.9)
df_cm2=df_mc_generator_slim(df_cm2)
df_cm2

### Step 4: Prediction with the endpoint

In [None]:
sample_pred = predict_from_numpy_V2(predictor, df_cm2.drop(["ID"], axis=1))
sample_pred_hpo = predict_from_numpy_V2(hpo_predictor, df_cm2.drop(["ID"], axis=1))
print(f"Predicted probability for sugery for the patient is : {round(sample_pred[0],2)} from model1,\n \t \t \t \t \t\t      {round(sample_pred_hpo[0],2)} from model2 after HPO ")

---
## Conclusion
SageMaker built-in algorithms are great for getting a first model fast, and combining them with SageMaker HPO can really boost their accuracy. As we mentioned here, the best way to success with a built-in algorithm is to **read the [algorithm's doc pages](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html) carefully** - to understand what data format and parameters it needs!

In our run, we have used a built-in algorithm XGBoost to train a classification model based on doctors' transcriptions. The first model with self-defined HyperParameters showed AUC of ~0.945, and the optimized HPO model exhibited an AUC of ~0.951: fairly higher !


After that, we demonstrated a simple pipeline to process an electronic patient's record with the endpoint. 

---
## Clean up resources
### Delete the endpoint and configuration if needed

In [None]:
predictor.delete_endpoint(delete_endpoint_config=True)
hpo_predictor.delete_endpoint(delete_endpoint_config=True)

### Delete the generated files S3 bucket files
+ check your [S3 bucket](https://s3.console.aws.amazon.com/s3/home?region=ap-southeast-1) for the content information
+ delete all the saved models and training jobs in the bucket, under folder */emr-mtSample*
+ delete the pdf files in S3 bucket under folder */sagemaker/medical_notes*


In [None]:
## Delete all the content in the emr-mtSample folder. Check S3 before deleting it
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket_name)
bucket.objects.filter(Prefix=bucket_prefix).delete()

In [None]:
### Delete all the content in the PDF folder 

bucket.objects.filter(Prefix=PDFprefix).delete()

### Best Practice:
 1. Delete the bucket if this is the last lab of your workshop
 2. Shut down your notebook instance if you are not planning to explore more labs