# Predict Hospital Spending Per Patient with SageMaker Autopilot
In this lab we'll get started with SageMaker using Autopilot! In particular we will download the Medicare dataset, clean it, and plug it into a framework for SageMaker Autopilot.

You'll see the notebooks generated for you, the hundreds of models trained, in addition to your very own inference pipeline, deployable to a SageMaker endpoint or batch transform job!

At the end, we'll set up a SHAP explainer to analyze local feature importance for a set of predictions. Let's get started!

In [None]:
# Download the Mediare dataset as csv file to the notebook
!wget -O Medicare_Hospital_Spending_by_Claim.csv https://data.medicare.gov/api/views/nrth-mfg3/rows.csv?accessType=DOWNLOAD

### Data Preprocessing on the Raw Dataset
In this section we read the raw csv data set into a pandas data frame. We inspect the data using pandas head() function. We do data pre-processing using feature encoding, feature engineering, column renaming, dropping some columns that have no relevance to the prediction of `Avg_Hosp` cost and examining there are no missing values in the data set

In [None]:
# Read the CSV file into panda dataframe and save it to another table so we can keep a copy of the original dataset
# In our example we use the dataframe called table1 for all pre-processing, while the dataframe table
# maintains a copy of the original data

import pandas as pd
table = pd.read_csv('Medicare_Hospital_Spending_by_Claim.csv')
table1 = table.copy()
table1.head()

In [None]:
# Encode column "State"

replace_map = {'State': {'AK': 1, 'AL': 2, 'AR': 3, 'AZ': 4, 'CA': 5, 'CO': 6, 'CT': 7, 
 'DC': 8, 'DE': 9, 'FL': 10, 'GA': 11, 'HI': 12, 
 'IA': 13, 'ID': 14, 'IL': 15, 'IN': 16, 'KS': 17, 
 'KY': 18, 'LA': 19, 'MA': 20, 'ME': 21, 'MI': 22, 
 'MN': 23, 'MO': 24, 'MS': 25, 'MT': 26, 'NC': 27, 
 'ND': 28, 'NE': 29, 'NH': 30, 'NJ': 31, 'NM': 32, 
 'NV': 33, 'NY': 34, 'OH': 35, 'OK': 36, 'OR': 37, 
 'PA': 38, 'RI': 39, 'SC': 40, 'SD': 41, 'TN': 42, 
 'TX': 43, 'UT': 44, 'VA': 45, 'VT': 46, 'WA': 47, 
 'WI': 48, 'WV': 49, 'WY': 50}}
table1.replace(replace_map,inplace=True)

In [None]:
# Encode column "Period"

replace_map = {'Period': {'1 to 3 days Prior to Index Hospital Admission': 1, 
 'During Index Hospital Admission': 2, 
 '1 through 30 days After Discharge from Index Hospital Admission': 3, 
 'Complete Episode': 4}}
table1.replace(replace_map,inplace=True)

In [None]:
# Encode column "Claim Type"

replace_map = {'Claim Type': {'Home Health Agency': 1, 
 'Hospice': 2, 
 'Inpatient': 3, 
 'Outpatient': 4, 
 'Skilled Nursing Facility': 5, 
 'Durable Medical Equipment': 6, 
 'Carrier': 7, 
 'Total': 8}}
table1.replace(replace_map,inplace=True)

In [None]:
# Convert the column "Percent of Spending Hospital	Percent of Spending" to float, remove the percent sign and 
# divide by 100 to normalize for percentage

table1['Percent of Spending Hospital'] = table1['Percent of Spending Hospital'].str.rstrip('%').astype('float')
table1['Percent of Spending Hospital'] = table1['Percent of Spending Hospital']/100

In [None]:
# Convert the column "Percent of Spending State" to float, remove the percent sign and 
# divide by 100 to normalize for percentage

table1['Percent of Spending State'] = table1['Percent of Spending State'].str.rstrip('%').astype('float')
table1['Percent of Spending State'] = table1['Percent of Spending State']/100

In [None]:
# Convert the column "Percent of Spending Nation" to float, remove the percent sign and 
# divide by 100 to normalize for percentage

table1['Percent of Spending Nation'] = table1['Percent of Spending Nation'].str.rstrip('%').astype('float')
table1['Percent of Spending Nation'] = table1['Percent of Spending Nation']/100

In [None]:
# Drop Column "Facility Name", Facility Id related to the facility, hence facility name is not
# relevant for the model

table1.drop(['Facility Name'], axis=1, inplace = True)

In [None]:
# Move the "Avg Spending Per Episode Hospital" column to the beginning, since the
# algorithm requires the prediction column at the beginning

col_name='Avg Spending Per Episode Hospital'
first_col = table1.pop(col_name)
table1.insert(0, col_name, first_col)

In [None]:
# Convert integer values to float in the columns "Avg Spending Per Episode Hospital", 
# "Avg Spending Per Episode State" and "Avg Spending Per Episode Nation"
# Columns with integer values are interpreted as categorical values. Changing to float avoids any mis-interpretetaion

table1['Avg Spending Per Episode Hospital'] = table1['Avg Spending Per Episode Hospital'].astype('float')
table1['Avg Spending Per Episode State'] = table1['Avg Spending Per Episode State'].astype('float')
table1['Avg Spending Per Episode Nation'] = table1['Avg Spending Per Episode Nation'].astype('float')

In [None]:
# Rename long column names for costs and percentage costs on the hospital, state and nation,
# so they are easily referenced in the rest of this discussion

table1.rename(columns={'Avg Spending Per Episode Hospital':'Avg_Hosp',
 'Avg Spending Per Episode State':'Avg_State',
 'Avg Spending Per Episode Nation':'Avg_Nation',
 'Percent of Spending Hospital':'Percent_Hosp',
 'Percent of Spending State':'Percent_State',
 'Percent of Spending Nation':'Percent_Nation'}, 
 inplace=True)

In [None]:
# Convert Start Date and End Date to datetime objects, then convert them to integers. First the data is converted
# to Pandas datetime object. Then the year, month and days are extracted from the datetime object and 
# multipled with some weights to convert into final integer values.

table1['Start Date'] = pd.to_datetime(table1['Start Date'])
table1['End Date'] = pd.to_datetime(table1['End Date'])
table1['Start Date'] = 1000*table1['Start Date'].dt.year + 100*table1['Start Date'].dt.month + table1['Start Date'].dt.day
table1['End Date'] = 1000*table1['End Date'].dt.year + 100*table1['End Date'].dt.month + table1['End Date'].dt.day

In [None]:
# See the first 5 rows in the dataframe to see how the changed data looks

table1.head()

In [None]:
# Drop Columns "Start Date" and "End Date". The dataset is only for 2018, hence all start and end dates
# are same in each row and does not impact the model

table1.drop(['Start Date'], axis=1, inplace = True)
table1.drop(['End Date'], axis=1, inplace = True)

In [None]:
# Make sure the table do not have missing values. The following code line shows there are no missing values
# in the table

table1.isna().sum()

In [None]:
df = table1.sample(frac=1)

In [None]:
fraction_train = .85
test_row = round(df.shape[0] * fraction_train)
test_set = df.iloc[test_row:]
train_set = df.iloc[:test_row]

In [None]:
local_train_file = 'train_set.csv'

train_set.to_csv(local_train_file, index=False, header=True)
test_set.to_csv('test_set.csv', index=False, header=True)

In [None]:
# optionally run some of your own plots here to analyze the data

# SageMaker Autopilot
Next, let's run this dataset on SageMaker Autopilot! 

In [None]:
from sagemaker import AutoML
from time import gmtime, strftime, sleep
import numpy as np
import sagemaker

sess = sagemaker.Session()

role = sagemaker.get_execution_role()

timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())
base_job_name = 'cost-prediction-' + timestamp_suffix

target_attribute_name = 'Avg_Hosp'
target_attribute_values = np.unique(train_set[target_attribute_name])
target_attribute_true_value = target_attribute_values[1] # 'True.'

automl = AutoML(role=role,
 target_attribute_name=target_attribute_name,
 base_job_name=base_job_name,
 sagemaker_session=sess,
 max_candidates=20,
 problem_type = 'Regression',
 job_objective = {'MetricName':'MSE'})

automl.fit(local_train_file, job_name=base_job_name, wait=True, logs=True)

After you run this cell, open up the Experiments tab on SageMaker Studio, right click on your new `cost-prediction` job, and view the AutoML job details! 

![](../../Images/Autopilot.png)

Once the state of the job has moved into `Feature Engineering`, you should be able to open the data exploration notebook, in addition to the candidate generation notebook. 

Spend some time stepping through these notebooks. You can also download the data transformation code base. Remember, all of this was generated for your specific dataset!

---
# Analyze Autopilot Modeling Performance
Your AutoML job will take some time to complete. Feel free to use that time to step through the generated notebooks and learn about all the feature engineering strategies they are using! 

Once your job has finished, it's time to analyze that performance. Luckily for us we can simply deploy that entire artifact onto an endpoint, using the same `model.deploy()` that we saw earlier. Let's do that here.

We'll attach the name of your job to an AutoML estimator, so please make sure to paste in the name of your job below.

In [None]:
from datetime import datetime
from sagemaker import AutoML
import sagemaker
import numpy as np

sess = sagemaker.Session()

# if you needed to restart you kernel, you can attach your AutoML job here
automl_job_name = 'COST-PREDICTION-28-02-12-32' #<== REPLACE THIS WITH YOUR OWN AUTOML JOB NAME
automl = AutoML.attach(automl_job_name, sagemaker_session=sess)

ep_name = 'automl-endpoint-' + datetime.now().strftime('%S')

inference_response_keys = ['predicted_label', 'probability']

# Create the inference endpoint
automl.deploy(1, 'ml.m5.xlarge', endpoint_name = ep_name) #inference_response_keys=inference_response_keys)

In [None]:
!pip install --upgrade sagemaker

In [None]:
from sagemaker.predictor import RealTimePredictor
class AutomlEstimator:
 def __init__(self, endpoint_name, sagemaker_session):
 self.predictor = RealTimePredictor(
 endpoint_name=endpoint_name,
 sagemaker_session=sagemaker_session,
 serializer=sagemaker.serializers.CSVSerializer(),
 content_type='text/csv',
 accept='text/csv'
 )
 # Prediction function for regression
 def predict(self, x):
 response = self.predictor.predict(x)
 return np.array([float(x) for x in response.decode('utf-8').split(',')])

In [None]:
# make sure this is pointing to the right endpoint name - if you reran that cell above you may have overwitten the variable in memory
automl_estimator = AutomlEstimator(endpoint_name=ep_name, sagemaker_session=sess)

In [None]:
import pandas as pd

test_data = pd.read_csv('test_set.csv')

# Explain Global and Local Modeling Performance with SHAP
A key question that many stakeholders will have is how your model came to its predictions, both for the entire dataset and for individual predictions. In this lab we'll set up a SHAP model explainer to view feature importances. Feature importances can be understood both in terms of "local," or per-prediction, and "global," or for the entire datset.

We will actually wrap your model endpoint to provide these.

In [None]:
!conda update -n base -c defaults conda -y

In [None]:
!conda install -c conda-forge -y shap

In [None]:
import shap

from shap import KernelExplainer
from shap import sample
from scipy.special import expit

# Initialize plugin to make plots interactive.
shap.initjs()

In [None]:
data_without_target = test_data.drop(columns=['Avg_Hosp'])

background_data = sample(data_without_target, 50)

In [None]:
# Derive link function 
problem_type = automl.describe_auto_ml_job(job_name=automl_job_name)['ResolvedAttributes']['ProblemType'] 
link = "identity" if problem_type == 'Regression' else "logit" 

# the handle to predict_proba is passed to KernelExplainerWrapper since KernelSHAP requires the class probability
explainer = KernelExplainer(automl_estimator.predict, background_data, link=link)

In [None]:
# Since expected_value is given in the log-odds space we convert it back to probability using expit which is the inverse function to logit
print('expected value =', explainer.expected_value)

In [None]:
%%writefile managed_endpoint.py

import boto3
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker',region_name=region)

class ManagedEndpoint:
 def __init__(self, ep_name, auto_delete=False):
 self.name = ep_name
 self.auto_delete = auto_delete
 
 def __enter__(self):
 endpoint_description = sm.describe_endpoint(EndpointName=self.name)
 if endpoint_description['EndpointStatus'] == 'InService':
 self.in_service = True 

 def __exit__(self, type, value, traceback):
 if self.in_service and self.auto_delete:
 print("Deleting the endpoint: {}".format(self.name)) 
 sm.delete_endpoint(EndpointName=self.name)
 sm.get_waiter('endpoint_deleted').wait(EndpointName=self.name)
 self.in_service = False

In [None]:
# Get the first sample
x = data_without_target.iloc[0:1]

# ManagedEndpoint can optionally auto delete the endpoint after calculating the SHAP values. To enable auto delete, use ManagedEndpoint(ep_name, auto_delete=True)
from managed_endpoint import ManagedEndpoint
with ManagedEndpoint(ep_name) as mep:
 shap_values = explainer.shap_values(x, nsamples='auto', l1_reg='aic')

# Visualize SHAP Values
Now, let's see which features are more strongly influencing the predictions from our model!

![](images/shap_1.png)

In [None]:
# Since shap_values are provided in the log-odds space, we convert them back to the probability space by using LogitLink
shap.force_plot(explainer.expected_value, shap_values, x, link=link)

![](images/shap_2.png)

In [None]:
with ManagedEndpoint(ep_name) as mep:
 shap_values = explainer.shap_values(x, nsamples='auto', l1_reg='num_features(5)')
shap.force_plot(explainer.expected_value, shap_values, x, link=link)

In [None]:
# Sample 50 random samples
X = sample(data_without_target, 50)

# Calculate SHAP values for these samples, and delete the endpoint
with ManagedEndpoint(ep_name, auto_delete=True) as mep:
 shap_values = explainer.shap_values(X, nsamples='auto', l1_reg='aic')

![](images/shap_3.png)

In [None]:
shap.force_plot(explainer.expected_value, shap_values, X, link=link)

![](images/shap_4.png)

In [None]:
shap.summary_plot(shap_values, X, plot_type="bar")

---
# Optional - Extend Autopilot with your own feature engineering code
If you have extra time after getting to the local inference explanations, why not take a look at bringing your own feature engineering code into SageMaker Autopilot? Remember that this notebook started with ~10 basic ETL steps in Python to convert the raw Medicare data into something our models could even start to loook at. Look at the following example to see how to port your own ETL scripts into SageMaker Autopilot for custom feature engineering.

Remember, once you get the entire pipeline deployed onto an endpoint, it means you can send the raw data up to the endpoint, and it will perform both feature engineering and model infereing for you, all in real time!

- https://github.com/aws/amazon-sagemaker-examples/tree/master/autopilot/custom-feature-selection